Glossary·methodology

Cross-language harmonisation

How Theia maps raw extracted labels from any source language to canonical properties. 'Battery life' / 'akkulaufzeit' / 'autonomie batterie' / 'autonomia' all resolve to BATTERY_LIFE — but extraction happens in the source language first.

The problem this solves

Multi-market consumer research has a translation problem.

Most tools translate everything to English first, then extract. This is catastrophic:

  • A German reviewer doesn't say "battery life" — they say "akkulaufzeit", and the nuance is different
  • French Reddit threads use distinct vocabulary that doesn't survive Google Translate
  • Korean parenting forum slang is structurally untranslatable
  • Translation collapses 80% of the signal and amplifies the rest as noise

Theia solves this with native extraction + canonical mapping.

The two-step process

Step 1: Extract in source language.

German reviews are processed by a German-aware extraction pass. The Claude model is prompted in German, returns German labels, and German sentiment is scored against German distributions.

Same for French, Italian, Japanese, Korean.

Step 2: Map to canonical properties.

After extraction, raw labels are mapped to a controlled vocabulary:

Raw labelSourceCanonical property
battery lifeUK reviewBATTERY_LIFE
akkulaufzeitDE reviewBATTERY_LIFE
autonomie batterieFR reviewBATTERY_LIFE
autonomiaIT reviewBATTERY_LIFE
batteryYouTube UKBATTERY_LIFE

Now you can ask "how does BATTERY_LIFE sentiment compare across UK / DE / FR / IT?" — and the answer is real, not a translation artefact.

Why this matters

Three reasons consumer brands and research firms care about this:

01 — Multi-market launches need real perception comparison. If you're launching in 6 markets, you need to know which feature is the strongest sell in each. Translation-based tools give you 6 identical answers.

02 — Source-language vocabulary reveals positioning gaps. The German term "umweltbewusst" (environmentally aware) appears in Bose German reviews 4× more than the English equivalent. A translation-first pipeline would miss this entirely.

03 — Editorial and influencer extraction depends on it. The French YouTube creator ecosystem talks about cameras differently from the UK one. Native extraction preserves the difference; harmonisation makes it comparable.

Implementation

The canonical vocabulary is stored in a label_harmonisation table. Every snippet stored in rag_snippets carries its raw label and links to the canonical property via this mapping.

The mapping itself is built and maintained as a fixed entity ontology — curated by domain experts, not LLM-discovered. This prevents canonical drift, super-node concentration, and the kind of taxonomy entropy that destroys retrieval quality over time.

Strategic implication

For a research firm pitching multi-market clients, native extraction + harmonisation is a credible moat. Most competitor tools either (a) skip non-English markets, (b) translate first and lose signal, or (c) report per-language only and never compare.

The firms that solved harmonisation are the ones taking multi-market work others can't deliver.