Vectorize is Cloudflare's managed vector database, designed for Workers-native semantic search. Vectors are stored with metadata, indexed for fast approximate nearest-neighbor (ANN) lookup, and queryable from any Worker without a network hop outside Cloudflare's edge.
How it fits b/cited's stack:
- Each Search Console query the project pulls gets embedded via OpenAI's
text-embedding-3-small(1536 dimensions) - The embedding goes into Vectorize, tagged with
project_id,user_id, and the source query text - During clustering, we run
vectorize.query()against each unclustered query to find the most similar neighbors above a 0.85 cosine similarity threshold - Greedy assignment groups neighbors into clusters; the centroid recomputes as members join
The result: queries about "schema markup setup" and "structured data implementation" end up in the same cluster even though they share no exact words — because their embeddings live near each other in vector space.
Why it matters
Without a vector database, semantic clustering means computing pairwise similarities on every cluster pass — O(n²) on the query count. With Vectorize, each query lookup is O(log n) via the ANN index, and the whole clustering pass scales linearly. A project with 5,000 queries clusters in under a minute instead of grinding for hours.
The architecture choice — using a vector database vs in-memory similarity — is the difference between b/cited running on Workers (no native deps, no long-running processes) and the original Python implementation that used HDBSCAN + numpy.
What b/cited does with it
- One Vectorize index per environment (
bcited-prod-queries), namespaced per project - Embeddings cached forever — re-embedding is unnecessary unless the OpenAI model changes
- Clustering uses Vectorize's
topK+ similarity threshold parameters; centroids stored as their own vectors