What is RAG snippets?

The primary storage unit of Theia's intelligence layer. Every extracted feature, benefit, use-case, comparison, and sentiment from every source ends up as a snippet — atomically queryable, source-traceable, harmonised.

RAG snippets

What it is

rag_snippets is a single BigQuery table that holds every piece of extracted intelligence. Reviews, YouTube transcripts, web articles, Reddit posts, BazaarVoice, AI Overviews — all converge here, atomically.

Each row is one snippet: one feature mention, one benefit claim, one use-case observation, one comparison, or one product sentiment. With source ID, sentiment score, harmonised canonical property, and timestamp.

This is the single primary table for analytical queries. Strategy agents read from it. The chatbot reads from pre-computed tables that read from it. Almost every interesting question about a market resolves to a query against rag_snippets joined with label_harmonisation.

Schema (simplified)

Column	Type	Notes
`snippet_id`	STRING	Unique
`source_type`	STRING	review, youtube, web_article, reddit, bazaarvoice, ai_overview
`source_id`	STRING	Joins back to review_id, video_id, article_id
`product_name`	STRING	Canonical (post-resolver)
`snippet_type`	STRING	feature, benefit, use_case, comparison, product_sentiment
`category_label`	STRING	Raw extracted label, source-language
`property_value`	STRING	e.g. "4K 60fps", "2 hours"
`sentiment_score`	FLOAT	-1.0 to 1.0
`sentiment_label`	STRING	positive / neutral / negative
`snippet_text`	STRING	Exact quote — never modified
`country`	STRING	lowercase: uk, de, fr, it, us
`language`	STRING	source language
`created_at`	TIMESTAMP	Source publication date, not pipeline time

Why this design works

Three properties matter:

01 — Atomic. One row = one observation. No nested arrays of properties per source. This makes aggregation trivial and joins predictable.

02 — Source-traceable. Every snippet carries source_type + source_id. Click through to the review, the video, the article. Strategy agents include source citations because the data is structured to support them.

03 — Harmonised on read. category_label stays in the source language. The label_harmonisation join provides the canonical property at query time. This preserves nuance while enabling cross-language analysis.

How it gets populated

The intelligence pipeline writes to rag_snippets:

Reviews: Oxylabs scrape → Claude extraction → snippet rows
YouTube: Oxylabs transcript → Claude extraction → snippet rows
Web articles: SERP discovery → Oxylabs scrape → Claude extraction → snippet rows
Reddit: Reddit API → Claude extraction → snippet rows
AI Overviews: DataForSEO LLM Mentions → parsed → snippet rows

Every write goes through product_resolver.resolve_to_canonical() first, so product_name is always a canonical model name.

Scale

Production deployments:

Canon Consumer (UK/DE/FR/IT): 737K snippets
Canon B2B (US/DE/JP/KR): 90K+ snippets, growing
Bose Germany: 90K snippets from a 10-day pilot
Principality (UK FS): 90,107 snippets in 6 weeks

The largest tables are 4M+ snippets. Query latency on a single-product look-up is sub-second; full-category aggregation is single-digit seconds.

What it isn't

It's not a vector store. There's no embedding column on rag_snippets. Retrieval uses structured queries against the canonical property (fixed entity architecture), not cosine similarity to a question.

Vector search has its place (chatbot fallback for ambiguous queries) but the primary retrieval surface is structured. This is faster, more deterministic, and more explainable than vector RAG for the kinds of questions market research actually asks.

Related terms