What is Cloudflare Vectorize?

Vectorize — Glossary

Cloudflare's managed vector database. b/cited stores query embeddings here for semantic clustering — turns the 90 days of Search Console queries into a graph where similar intents sit near each other.

Vectorize is Cloudflare's managed vector database, designed for Workers-native semantic search. Vectors are stored with metadata, indexed for fast approximate nearest-neighbor (ANN) lookup, and queryable from any Worker without a network hop outside Cloudflare's edge.

How it fits b/cited's stack:

Each Search Console query the project pulls gets embedded via OpenAI's text-embedding-3-small (1536 dimensions)
The embedding goes into Vectorize, tagged with project_id, user_id, and the source query text
During clustering, we run vectorize.query() against each unclustered query to find the most similar neighbors above a 0.85 cosine similarity threshold
Greedy assignment groups neighbors into clusters; the centroid recomputes as members join

The result: queries about "schema markup setup" and "structured data implementation" end up in the same cluster even though they share no exact words — because their embeddings live near each other in vector space.

Why it matters

Without a vector database, semantic clustering means computing pairwise similarities on every cluster pass — O(n²) on the query count. With Vectorize, each query lookup is O(log n) via the ANN index, and the whole clustering pass scales linearly. A project with 5,000 queries clusters in under a minute instead of grinding for hours.

The architecture choice — using a vector database vs in-memory similarity — is the difference between b/cited running on Workers (no native deps, no long-running processes) and the original Python implementation that used HDBSCAN + numpy.

What b/cited does with it

One Vectorize index per environment (bcited-prod-queries), namespaced per project
Embeddings cached forever — re-embedding is unnecessary unless the OpenAI model changes
Clustering uses Vectorize's topK + similarity threshold parameters; centroids stored as their own vectors

Vectorize (Cloudflare Vectorize)

Why it matters

What b/cited does with it