The problem this solves
Multi-market consumer research has a translation problem.
Most tools translate everything to English first, then extract. This is catastrophic:
- A German reviewer doesn't say "battery life" — they say "akkulaufzeit", and the nuance is different
- French Reddit threads use distinct vocabulary that doesn't survive Google Translate
- Korean parenting forum slang is structurally untranslatable
- Translation collapses 80% of the signal and amplifies the rest as noise
Theia solves this with native extraction + canonical mapping.
The two-step process
Step 1: Extract in source language.
German reviews are processed by a German-aware extraction pass. The Claude model is prompted in German, returns German labels, and German sentiment is scored against German distributions.
Same for French, Italian, Japanese, Korean.
Step 2: Map to canonical properties.
After extraction, raw labels are mapped to a controlled vocabulary:
| Raw label | Source | Canonical property |
|---|---|---|
battery life | UK review | BATTERY_LIFE |
akkulaufzeit | DE review | BATTERY_LIFE |
autonomie batterie | FR review | BATTERY_LIFE |
autonomia | IT review | BATTERY_LIFE |
battery | YouTube UK | BATTERY_LIFE |
Now you can ask "how does BATTERY_LIFE sentiment compare across UK / DE / FR / IT?" — and the answer is real, not a translation artefact.
Why this matters
Three reasons consumer brands and research firms care about this:
01 — Multi-market launches need real perception comparison. If you're launching in 6 markets, you need to know which feature is the strongest sell in each. Translation-based tools give you 6 identical answers.
02 — Source-language vocabulary reveals positioning gaps. The German term "umweltbewusst" (environmentally aware) appears in Bose German reviews 4× more than the English equivalent. A translation-first pipeline would miss this entirely.
03 — Editorial and influencer extraction depends on it. The French YouTube creator ecosystem talks about cameras differently from the UK one. Native extraction preserves the difference; harmonisation makes it comparable.
Implementation
The canonical vocabulary is stored in a label_harmonisation table. Every snippet stored in rag_snippets carries its raw label and links to the canonical property via this mapping.
The mapping itself is built and maintained as a fixed entity ontology — curated by domain experts, not LLM-discovered. This prevents canonical drift, super-node concentration, and the kind of taxonomy entropy that destroys retrieval quality over time.
Strategic implication
For a research firm pitching multi-market clients, native extraction + harmonisation is a credible moat. Most competitor tools either (a) skip non-English markets, (b) translate first and lose signal, or (c) report per-language only and never compare.
The firms that solved harmonisation are the ones taking multi-market work others can't deliver.
Related concepts
Related terms