What is Fixed entity architecture?

The principle that the market ontology should be defined by domain experts, not discovered by an LLM. Canonical properties, segments, and product taxonomies are curated. LLMs handle extraction, math handles connections.

Fixed entity architecture

The principle

There is a tempting failure mode in AI-driven market research: let the LLM build the ontology. Ask GPT to "find the themes" and use whatever comes out as the entity schema.

This is a disaster at scale. The themes drift between runs. New batches reshuffle the taxonomy. Cross-language coherence collapses. Super-nodes (single concepts matching >50% of content) form and destroy retrieval quality.

The fix: fixed entity architecture (Adamchic, 2024). Domain experts define the ontology. LLMs only do the work LLMs are good at — extraction, sentiment, harmonisation in source language.

How Theia applies it

Three layers, each with a clear role:

Layer 1 — Fixed ontology (expert-curated)

label_harmonisation canonical properties
stackline_asin_clusters segment names
product_taxonomy model families and brand mappings

These are stable. They don't get rewritten by every enrichment run.

Layer 2 — Documents (linked via harmonisation)

rag_snippets — every piece of extracted content
Linked to L1 via string match on harmonisation table, not cosine similarity
Cheaper and more precise than vector match when the mapping is curated

Layer 3 — NLP entities (regex + spaCy + small models)

Product names via regex + fuzzy matching (product_resolver)
Keyword type classification (generic / brand / product)
spaCy lemmatisation in the harmonisation pipeline

Why "LLMs for extraction, math for connections"

LLMs are essential where semantic understanding is genuinely required:

Feature extraction from messy review text
Cross-language sentiment scoring
Resolving "this thing is great" to a specific feature

But graph connections should not use LLM judgement. They use:

Cosine similarity for product edges
TF-IDF and HHI for keyword distinctiveness
Leiden for clustering
String match for harmonisation lookup

This keeps the graph reproducible, fast, and free after the initial extraction.

Super-node prevention

A super-node is a single ontology concept that matches >50% of content. It destroys retrieval ("autofocus" matches every camera review, so it answers nothing). Theia prevents super-nodes through three mechanisms:

UNCATEGORISED filtering — any extraction that didn't map cleanly is filtered from analytical queries
Distinctiveness scoring — concepts must be specific to a segment to count
Segment lookup pattern — exact cluster match, never LIKE-pattern matching

Strategic implication

For research firms and platform vendors building intelligence layers: don't let the LLM define the schema. Hire the domain expert to define it once, then use LLMs for the work that requires language understanding.

The teams that learned this lesson have stable graphs that compound in value. The teams that didn't are rebuilding their ontology every quarter.

Source

This principle is articulated cleanly in Adamchic (2024), "Fixed Entity Architecture for Production RAG". Theia's data architecture is one production-scale implementation.

Related terms