The problem this solves
Most AI-driven market research demos work brilliantly once and fall apart at scale. Three failure modes consistently kill them:
- Runaway API spend — an LLM extraction loop forgets a cap and burns $5,000 in an afternoon
- Silent data loss — a SERP retrieval times out, the row is dropped, no one notices for two weeks
- Irreproducible outputs — re-running the same analysis gives a different answer, and nobody can explain why
The fix is a deterministic harness wrapped around every pipeline step. Without it, AI research is a demo. With it, AI research is infrastructure.
What the harness does
Five mechanisms enforce determinism, safety, and reproducibility across the entire Theia pipeline:
01 — Guards
Assertion functions called at every checkpoint that costs money. They raise a typed error (PipelineGuardError) and halt execution before damage is done.
| Guard | Purpose |
|---|---|
assert_elbow_applied | Verify selection actually reduced candidates before scraping |
assert_max_items | Hard cap (e.g. 150 URLs to Labs, 2000 keywords to SERP) |
assert_scoped | Refuse unscoped runs (no "process everything") |
assert_country_set | Country parameter required at every entry point |
assert_budget | Block if estimated cost exceeds the cap (default $50/batch) |
The assert_budget guard alone has prevented at least four "incident reports" we know about across consumer brand deployments. A typo in a keyword universe used to cost a junior analyst their afternoon and the team a five-figure cloud invoice. Now it raises an error.
02 — Run tracking
Every pipeline run gets a unique run_id. Every step within the run is recorded with status (pending / running / completed / failed), timestamps, and detail. The state is persisted (Supabase) so it survives crashes.
On resume, the flow checks tracker.is_step_done() before each step and skips completed ones. A 12-step pipeline that crashes at step 8 resumes from step 8 — not from scratch.
03 — Structured polling results
API polling returns a typed dataclass, not a bare list:
@dataclass
class RetrievalResult:
completed: List[str]
failed: List[str]
timed_out: List[str]
Failed and timed-out tasks are visible. They can be retried, escalated, or logged. They never silently disappear.
04 — BQ insertion error tracking
Every batch insert to BigQuery is wrapped. Failures are stored in an _failed_inserts list with the rows, the error response, and the timestamp. The pipeline checks this list after each step and either retries with exponential backoff or escalates to the analyst.
Before this discipline, BQ streaming-buffer hiccups silently dropped 0.1-2% of rows on every run. After this discipline, dropped rows are zero or visible.
05 — Cumulative-CTR selection (not LLM judgement)
Where the pipeline selects URLs to scrape or keywords to process, the selection is mathematical and deterministic — not an LLM "pick the most relevant" call.
For example: select URLs covering 65% of cumulative search-visibility CTR. The same input always produces the same output. The selection is interpretable ("we stopped at 65% because diminishing returns set in") and bounded (capped by max_labs_urls).
LLM-based selection at this stage would be cheaper to implement and dramatically less reproducible. The deterministic approach is the harder engineering call and the right one.
Why this matters strategically
Three things stop being true once the harness is in place:
01 — Cost stops being a wildcard. Every Theia engagement comes in on budget because the budget is asserted, not estimated. Consumer brand deployments at £1,500/month/country don't surprise the CFO at month-end.
02 — Outputs become defensible.
"Why does the deck say 47% market share?" gets a real answer: the exact run_id, the exact source URLs, the exact aggregation, the exact harmonisation step. Strategy that can't be defended at this level doesn't earn the right to be acted on.
03 — Pipelines become composable. Once each step is bounded, resumable, and reproducible, you can chain them safely. The 5-stage Theia pipeline (Collect → Enrich → Structure → Strategise → Converse) only works because each stage has the harness. Without it, you'd be rebuilding from scratch every Monday.
Why most AI research tools don't have this
Three reasons:
01 — Harness work is invisible. A guard that prevents a £10K mistake doesn't show up in a demo. A tracker that enables resume is boring. Vendors don't market what doesn't sell.
02 — Harness discipline is expensive at the start. Every pipeline step has to be wrapped. Every API client has to return structured results. Every cost path has to be asserted. This is 30% of pipeline engineering effort, paid up-front, returned over 18-24 months.
03 — The market doesn't yet penalise the absence. A research firm can ship an unverifiable LLM-generated deck today and the client won't know to ask "what was the run_id?" Until that question becomes standard, vendors that built the harness will be invisibly out-engineering vendors that didn't.
Strategic implication
For any brand evaluating an AI-driven market research platform in 2026, three questions matter more than the dashboard demo:
- "Show me a budget assertion in the codebase."
- "Show me a run that was resumed after a crash."
- "Show me two runs against the same inputs and tell me why they produced identical outputs."
Vendors that can't answer are selling demos. Vendors that can are selling infrastructure.