Glossary·methodology

Leiden community detection

The clustering algorithm Theia uses to find market segments in keyword × product graphs. Successor to Louvain, with mathematical guarantees Louvain lacks. Run with Surprise optimisation for well-separated communities.

What it is

Leiden is a community detection algorithm — given a graph (nodes connected by weighted edges), it finds clusters of nodes that are more densely connected to each other than to the rest of the graph.

It was introduced by Traag, Waltman & van Eck (2019) as a fix for Louvain's known failure mode: Louvain can produce badly connected communities where a node ends up assigned to a cluster it is barely connected to. Leiden adds a refinement step that guarantees every community is internally connected.

Why Theia uses it

The Theia market structure is a bipartite graph: keywords on one side, products (or URLs) on the other, with edges weighted by CTR-adjusted traffic.

To find market segments, we need to partition this graph into groups where:

  1. Products inside a segment compete for similar keyword pockets
  2. Keywords inside a segment route traffic to similar products
  3. The partition is stable — re-running on slightly different data shouldn't reshuffle clusters

Leiden delivers all three. Louvain fails (3) often enough to be unsafe for production use.

Surprise optimisation

Leiden can optimise different quality functions. We use Surprise rather than the default Modularity because:

  • Modularity has a known resolution limit — it merges small communities even when they shouldn't be merged. In market data, this means premium photo printers get absorbed into a generic "printers" cluster.
  • Surprise has no resolution limit. It finds the statistically most surprising partition relative to a null model of random edges.
  • Surprise is more sensitive to small, distinct clusters — which is exactly what consumer markets are made of.

The Canon EU market has 50+ distinct product segments. Modularity finds about 12. Surprise finds all 50+.

How Theia applies Leiden

The full process:

  1. Build the bipartite keyword × product graph
  2. Weight edges by CTR-adjusted traffic (so a position-1 ranking matters 4× more than position-4)
  3. Apply per-keyword dampening so generic head terms ("camera") don't dominate
  4. Run Leiden Surprise
  5. Model reassignment: iterate until every product is in the cluster whose centroid it's most similar to
  6. Cross-language merge: detect clusters that are the same segment in different languages (e.g. "drucker" + "printer") and merge
  7. Small cluster dissolution: clusters with fewer than 5 members get reassigned via LLM

The result is 50+ market segments per category, refreshable in under 10 minutes from cached parquet.

Why this matters strategically

Most market research firms give you 5 segments because that's what a survey-based methodology can resolve. Leiden on search data finds the segments that actually exist in consumer search behaviour — typically 8-15 in a focused category, 50+ in a broad one.

The difference between "5 invented segments" and "50 discovered segments" is the difference between a positioning slide and a strategy.