Glossary·methodology

HHI-weighted distinctiveness

How Theia ranks which keywords define which market segments. Borrows the Herfindahl-Hirschman concentration index from antitrust economics and applies it to keyword × cluster traffic distributions.

The problem this solves

Once Leiden has clustered the market, every cluster needs a name — and "name" really means "which keyword best represents what this cluster is about."

Naive approaches fail:

  • Highest-traffic keyword in the cluster picks generic head terms ("camera") that appear in every cluster
  • Most-frequent keyword picks brand names ("canon") which aren't segments
  • Manual naming doesn't scale and isn't reproducible

The keyword that should win is the one whose traffic is most concentrated in this cluster vs spread across many clusters.

What HHI measures

The Herfindahl-Hirschman Index is a concentration measure used by competition regulators to assess market structure. For a set of shares s_1, s_2, ..., s_n summing to 1:

HHI = Σ (s_i)²   for i = 1 … n
  • HHI = 1: complete concentration (one cluster owns 100% of the keyword's traffic)
  • HHI = 1/n: even spread across all n clusters
  • HHI closer to 1: the keyword is distinctive to a small number of clusters

How Theia uses it

For every keyword × cluster pair, we compute:

distinctiveness = HHI(keyword across clusters) × traffic in this cluster

The first term rewards concentration. The second term rewards relevance (you don't want a distinctive but tiny keyword winning).

The keyword that wins is distinctive AND material.

Examples from Canon EU

KeywordHHIBest clusterOutcome
camera0.04manyRejected — too spread
canon0.08manyRejected — brand term
mirrorless camera0.62mirrorless camerasCluster name
spiegellose kamera0.71mirrorless cameras (DE)Merged after cross-language step
wildlife photography camera0.89pro mirrorlessSub-segment defining
pg-540 ink0.94canon 540/541 inkCartridge family defining

Why this matters strategically

Distinctive keywords reveal what consumers think a segment IS.

If "wildlife photography camera" is highly distinctive to the pro mirrorless cluster, that's not just a naming convenience — it tells you that wildlife is the defining use case the market associates with pro mirrorless. Marketing copy, retailer category trees, and content briefs should reflect that.

We persist distinctiveness scores at two granularities:

  • distinctive_keywords_segment — which keywords define a market segment
  • distinctive_keywords_product — which keywords define a specific product

The second one is the basis for SEO and Amazon listing strategy: target the keywords your product is distinctively associated with, not the head terms where everyone fights for crumbs.