Glossary·tool

RAG snippets

The primary storage unit of Theia's intelligence layer. Every extracted feature, benefit, use-case, comparison, and sentiment from every source ends up as a snippet — atomically queryable, source-traceable, harmonised.

What it is

rag_snippets is a single BigQuery table that holds every piece of extracted intelligence. Reviews, YouTube transcripts, web articles, Reddit posts, BazaarVoice, AI Overviews — all converge here, atomically.

Each row is one snippet: one feature mention, one benefit claim, one use-case observation, one comparison, or one product sentiment. With source ID, sentiment score, harmonised canonical property, and timestamp.

This is the single primary table for analytical queries. Strategy agents read from it. The chatbot reads from pre-computed tables that read from it. Almost every interesting question about a market resolves to a query against rag_snippets joined with label_harmonisation.

Schema (simplified)

ColumnTypeNotes
snippet_idSTRINGUnique
source_typeSTRINGreview, youtube, web_article, reddit, bazaarvoice, ai_overview
source_idSTRINGJoins back to review_id, video_id, article_id
product_nameSTRINGCanonical (post-resolver)
snippet_typeSTRINGfeature, benefit, use_case, comparison, product_sentiment
category_labelSTRINGRaw extracted label, source-language
property_valueSTRINGe.g. "4K 60fps", "2 hours"
sentiment_scoreFLOAT-1.0 to 1.0
sentiment_labelSTRINGpositive / neutral / negative
snippet_textSTRINGExact quote — never modified
countrySTRINGlowercase: uk, de, fr, it, us
languageSTRINGsource language
created_atTIMESTAMPSource publication date, not pipeline time

Why this design works

Three properties matter:

01 — Atomic. One row = one observation. No nested arrays of properties per source. This makes aggregation trivial and joins predictable.

02 — Source-traceable. Every snippet carries source_type + source_id. Click through to the review, the video, the article. Strategy agents include source citations because the data is structured to support them.

03 — Harmonised on read. category_label stays in the source language. The label_harmonisation join provides the canonical property at query time. This preserves nuance while enabling cross-language analysis.

How it gets populated

The intelligence pipeline writes to rag_snippets:

  1. Reviews: Oxylabs scrape → Claude extraction → snippet rows
  2. YouTube: Oxylabs transcript → Claude extraction → snippet rows
  3. Web articles: SERP discovery → Oxylabs scrape → Claude extraction → snippet rows
  4. Reddit: Reddit API → Claude extraction → snippet rows
  5. AI Overviews: DataForSEO LLM Mentions → parsed → snippet rows

Every write goes through product_resolver.resolve_to_canonical() first, so product_name is always a canonical model name.

Scale

Production deployments:

  • Canon Consumer (UK/DE/FR/IT): 737K snippets
  • Canon B2B (US/DE/JP/KR): 90K+ snippets, growing
  • Bose Germany: 90K snippets from a 10-day pilot
  • Principality (UK FS): 90,107 snippets in 6 weeks

The largest tables are 4M+ snippets. Query latency on a single-product look-up is sub-second; full-category aggregation is single-digit seconds.

What it isn't

It's not a vector store. There's no embedding column on rag_snippets. Retrieval uses structured queries against the canonical property (fixed entity architecture), not cosine similarity to a question.

Vector search has its place (chatbot fallback for ambiguous queries) but the primary retrieval surface is structured. This is faster, more deterministic, and more explainable than vector RAG for the kinds of questions market research actually asks.