Canonicalizing LLM Labels with Embeddings and DSU

Added Oct 21, 2025
Article: PositiveCommunity: PositiveMixed
Canonicalizing LLM Labels with Embeddings and DSU

The article shows how to convert LLMs’ inconsistent labels into consistent ones by embedding content and labels, using vector search for cache hits, and clustering labels with DSU. Benchmarks demonstrate fewer unique labels, rising cache-hit rates, and costs that begin ~15% higher but become ~10x cheaper by 10K items, with latency also improving over time. A Golang package (consistent-classifier) and benchmark scripts are provided.

Key Points

  • LLMs are lexically inconsistent but semantically consistent; embeddings and clustering can reconcile label variability.
  • Pipeline: embed → similarity search cache → on hit return canonical root; on miss call LLM, then embed and cluster labels with DSU.
  • Results on 10K tweets: ~6,520 unique labels (LLM-only) vs ~1,381 (vectorized)—about 5x fewer via clustering.
  • Cache hit rate accelerates (modeled to ~95% asymptote), driving costs down from ~15% higher initially to ~10x cheaper by 10K items.
  • Latency is initially slower (~130%) due to embedding and search, but improves as LLM calls are increasingly avoided.

Sentiment

The community is broadly constructive and engaged, with the article sparking meaningful technical discussion. While many commenters appreciate the engineering effort and the article's quality, the dominant theme is that the approach is over-engineered — experienced practitioners offer multiple simpler alternatives that they have successfully used in production. The author's active participation in defending the streaming classification use case is well-received, though the core criticism that the problem could be avoided entirely with a predefined schema persists.

In Agreement

  • Open-set labeling is a genuinely useful and under-discussed LLM capability — described as data mining in the truest sense
  • The embedding + vector cache approach is clever for reducing costs at scale as cache hit rates increase
  • The article is well-written with solid benchmarks and the author provides useful open-source code
  • For streaming/online classification where data arrives continuously, the DSU approach solves a real problem that batch methods cannot
  • Vector storage is cheap and the cost savings from skipped LLM calls easily justify the infrastructure

Opposed

  • A simpler approach works for most cases: embed everything first, cluster, then have the LLM label the clusters — this is order-invariant and avoids bias from initial label choices
  • The approach is over-engineered for the scale discussed; for 6k labels you don't need Pinecone and can do everything in memory with local models
  • You should define your classification schema upfront rather than letting the LLM freestyle, which is the root cause of the inconsistency problem
  • Local embedding models running on CPU are faster and free, making the cost analysis less compelling when you have your own hardware
  • Accuracy validation is conspicuously absent — the article doesn't prove the vectorized approach produces correct labels
  • Full-text search with proper tokenization achieves comparable results without the complexity of vector embeddings
  • The use case of training AI to post tweets raises ethical concerns about authenticity on social media
Canonicalizing LLM Labels with Embeddings and DSU | TD Stuff