Clustering is the process that turns a flat list of ranking queries into a much smaller set of meaningful topics.
A typical project pulls 3,000-15,000 unique queries from 90 days of Google Search Console. Looking at any one of them in isolation is useless — you can't tell which represent the same underlying intent and which are genuinely different topics. Clustering is what makes that data actionable.
How BCited does it
Every query gets embedded with OpenAI's text-embedding-3-small model into a 1536-dimension vector that captures its semantic meaning. Then a centroid-greedy algorithm running over Cloudflare Vectorize groups vectors that are close enough together (cosine similarity above 0.85) into the same cluster. Each cluster gets a human-readable name from GPT-4.1-mini.
The result: a project with 8,000 GSC queries becomes ~40-80 topical clusters the dashboard can present, score, and brief.
Why not k-means or HDBSCAN
Both work in theory. Centroid-greedy + Vectorize wins on three things: it runs in a single Worker request budget, it handles arbitrary cluster shapes (HDBSCAN's specialty), and the cluster centroid lookup at query time is O(log n) — the same pattern internal-link suggestion and brief generation use later in the pipeline.