The principle
There is a tempting failure mode in AI-driven market research: let the LLM build the ontology. Ask GPT to "find the themes" and use whatever comes out as the entity schema.
This is a disaster at scale. The themes drift between runs. New batches reshuffle the taxonomy. Cross-language coherence collapses. Super-nodes (single concepts matching >50% of content) form and destroy retrieval quality.
The fix: fixed entity architecture (Adamchic, 2024). Domain experts define the ontology. LLMs only do the work LLMs are good at — extraction, sentiment, harmonisation in source language.
How Theia applies it
Three layers, each with a clear role:
Layer 1 — Fixed ontology (expert-curated)
label_harmonisationcanonical propertiesstackline_asin_clusterssegment namesproduct_taxonomymodel families and brand mappings
These are stable. They don't get rewritten by every enrichment run.
Layer 2 — Documents (linked via harmonisation)
rag_snippets— every piece of extracted content- Linked to L1 via string match on harmonisation table, not cosine similarity
- Cheaper and more precise than vector match when the mapping is curated
Layer 3 — NLP entities (regex + spaCy + small models)
- Product names via regex + fuzzy matching (
product_resolver) - Keyword type classification (generic / brand / product)
- spaCy lemmatisation in the harmonisation pipeline
Why "LLMs for extraction, math for connections"
LLMs are essential where semantic understanding is genuinely required:
- Feature extraction from messy review text
- Cross-language sentiment scoring
- Resolving "this thing is great" to a specific feature
But graph connections should not use LLM judgement. They use:
- Cosine similarity for product edges
- TF-IDF and HHI for keyword distinctiveness
- Leiden for clustering
- String match for harmonisation lookup
This keeps the graph reproducible, fast, and free after the initial extraction.
Super-node prevention
A super-node is a single ontology concept that matches >50% of content. It destroys retrieval ("autofocus" matches every camera review, so it answers nothing). Theia prevents super-nodes through three mechanisms:
- UNCATEGORISED filtering — any extraction that didn't map cleanly is filtered from analytical queries
- Distinctiveness scoring — concepts must be specific to a segment to count
- Segment lookup pattern — exact cluster match, never
LIKE-pattern matching
Strategic implication
For research firms and platform vendors building intelligence layers: don't let the LLM define the schema. Hire the domain expert to define it once, then use LLMs for the work that requires language understanding.
The teams that learned this lesson have stable graphs that compound in value. The teams that didn't are rebuilding their ontology every quarter.
Source
This principle is articulated cleanly in Adamchic (2024), "Fixed Entity Architecture for Production RAG". Theia's data architecture is one production-scale implementation.