Article
Scrape, Enrich, Score, Act: The Pipeline Behind Every Tool I Build
Jun 3, 2026 · 6 min read
There's a four-step pattern behind every commercial automation I've built — from CRM pipelines to creator discovery. Once you see it, you'll use it everywhere.
The pattern I keep rebuilding
Every meaningful automation I've built at Dansu follows the same four steps: scrape, enrich, score, act.
I didn't design it as a framework. I noticed it after building three separate systems — a CRM pipeline for B2B festival partnerships, a creator discovery engine, and a partner-scoring tool — and realising they all had the same skeleton. The surface details varied. The underlying structure did not.
If you're building any system that touches leads, contacts, creators, partners, or candidates, you're probably doing some version of this already. The difference between pipelines that work reliably and ones that collapse under pressure comes down to how deliberately each step is designed.
Step 1 — Scrape
Every pipeline starts with raw data collection. The source doesn't matter much: Instagram accounts, festival websites, LinkedIn profiles, directory listings, inbound email threads. What matters is that the scrape is structured from the first moment.
At Dansu, for the B2B festival pipeline, that meant scraping festival sites for contact information, pulling Instagram accounts for engagement signals, and parsing those results into clean rows with consistent fields. For the creator engine, it meant hitting RocketAPI and pulling reels data — views, likes, posting velocity, category tags.
The temptation at this stage is to just pull everything and sort it out later. That's a mistake. Unstructured scrapes create downstream mess that compounds at every subsequent step. I always define the schema before I write the first scraper: what fields do I need, what format should they arrive in, what's mandatory versus optional. That constraint is what makes the rest of the pipeline tractable.
Practically: scrapers break. Websites change structure. Rate limits kick in. The scrape layer needs to be fault-tolerant — retries with backoff, status tracking per record, and a way to resume from the last successful point rather than starting over.
Step 2 — Enrich
Raw scraped data is never enough to make a decision. The enrich step fills in the gaps.
For the festival CRM, enrichment meant taking a scraped company name and parsing out job titles, direct emails, and social handles where available. For creators, it meant calculating engagement rate, pulling historical post frequency, and classifying content category from caption signals.
This is where LLMs started earning their keep in my workflows. Structured field extraction from messy text is something a well-prompted model handles well — pulling a contact name from a block of "About" page copy, classifying a creator's niche from a short bio, or summarising a festival's scale and type from scraped text. The key is giving the model strict output schemas and only the information you've actually collected. No invented details.
Enrichment also handles deduplication. Before a record hits the scoring step, I check it against what's already in Supabase. Dedupe keys — unique identifiers per entity, not just per scrape run — prevent the same lead being processed twice when the scraper reruns.
Step 3 — Score
Enriched records are now comparable. Scoring turns a pile of data into a ranked list.
For festival partnerships, the scoring model weighted: relevance of the festival to the Dansu customer profile (outdoor, lifestyle, running adjacent), estimated attendance size, past partnership signals, and contact quality. A high-quality festival with a direct contact email scored well. A niche event with only a generic info address scored low.
For creators, the model weighted: engagement velocity (rising accounts, not plateauing ones), posting consistency, content-brand fit, and absence of red flags like inflated follower counts versus low engagement.
The score doesn't have to be sophisticated to be useful. A simple weighted sum beats manual gut-feel because it's consistent and auditable. You can look at why a record scored the way it did, adjust the weights if your intuition was off, and rerun the batch. You can't do that with vibes.
The scoring threshold is also where the human gate lives. Records above a threshold go into the act queue automatically. Records in a grey zone get flagged for review. Records below threshold are parked — not discarded, because signals change over time, but not acted on now.
Step 4 — Act
The act step is where the pipeline produces something in the real world. This is also where the most damage can happen if the earlier steps were sloppy.
For the CRM pipeline, act meant triggering personalised outreach — email drafts generated from the enriched fields, using only the data the pipeline had actually collected about that contact. Not invented context. Not generic openers. Two subject line variants, a three-sentence opener, and a clear CTA.
The rule I follow: the riskier the action, the more human approval it needs. Sending a cold email to a festival organiser is low enough stakes that I'll automate the send once the draft looks clean. Reaching out to a potential B2B anchor partner is not — that goes into a review queue regardless of score.
The act step also closes the feedback loop. Every send, every reply, every bounce gets written back to Supabase. That data feeds future scoring iterations. The pipeline improves as it runs.
Why this pattern scales
I've now applied this skeleton to three distinct contexts and it holds in all of them. The reason it scales is that each step has a clean interface to the next. Scrape produces structured rows. Enrichment adds fields to those rows. Scoring reads those fields and appends a score. Act consumes records above a threshold.
This means each layer can be improved independently. I can upgrade the enrichment logic without touching the scraper. I can tune the scoring model without changing how outreach is generated. I can swap the email tool without affecting anything upstream.
The other reason it scales: it's observable. Every record has a status in Supabase — queued, enriched, scored, sent, replied, failed. At any point I can see where the pipeline is, where it's stuck, and what failed. Silent failures are the enemy of any automation. A pipeline that enriches half a batch and you don't know about it is worse than one that fails loudly.
What to borrow
If you're building something similar, these are the design decisions that saved me the most debugging time:
- Define the schema before writing the scraper. Don't collect data and figure out structure later.
- Dedupe at the enrich step, not the act step. Catching duplicates late costs more than catching them early.
- Use scoring thresholds, not binary filters. A grey zone that routes to human review is almost always worth the complexity.
- Gate risky actions. Automate the boring parts aggressively. Gate the brand-sensitive parts deliberately.
- Close the feedback loop. Write outcomes back into the same database the pipeline reads from.
The pipeline is only as good as the loop it creates. If outcomes don't feed back into scoring, you're flying blind after the first run.