Canonicalizing LLM Labels with Embeddings and DSU

The article shows how to convert LLMs’ inconsistent labels into consistent ones by embedding content and labels, using vector search for cache hits, and clustering labels with DSU. Benchmarks demonstrate fewer unique labels, rising cache-hit rates, and costs that begin ~15% higher but become ~10x cheaper by 10K items, with latency also improving over time. A Golang package (consistent-classifier) and benchmark scripts are provided.

Key Points

LLMs are lexically inconsistent but semantically consistent; embeddings and clustering can reconcile label variability.
Pipeline: embed → similarity search cache → on hit return canonical root; on miss call LLM, then embed and cluster labels with DSU.
Results on 10K tweets: ~6,520 unique labels (LLM-only) vs ~1,381 (vectorized)—about 5x fewer via clustering.
Cache hit rate accelerates (modeled to ~95% asymptote), driving costs down from ~15% higher initially to ~10x cheaper by 10K items.
Latency is initially slower (~130%) due to embedding and search, but improves as LLM calls are increasingly avoided.

Sentiment

Overall, the Hacker News sentiment is positive and appreciative of the article's innovative solution. While constructive criticism is provided regarding potential biases and alternative approaches, the core idea is well-received, and the discussion focuses on refining and understanding the trade-offs of the proposed method. The OP actively engages, clarifying and defending their design choices.

In Agreement

The core concept of open-set labeling (or inverse classification) using LLMs is a powerful "superpower" for finding semantic clusters in unstructured data.
The dynamic nature of the proposed solution is highly beneficial and cost-effective for large, evolving datasets and categories, as it avoids expensive recalculations inherent in static classification systems.
Explicit LLM-generated tags, while perhaps not strictly necessary for simple classification, are crucial and powerful for downstream generative AI tasks like content creation, topic exploration, and prompt generation.
Similar approaches involving embedding comparison and clustering for dynamic categorization have been successfully implemented by others, reinforcing the validity of the general idea.

Opposed

The method may be sensitive to the order of data input and the initial labels generated by the LLM, potentially leading to order bias and inconsistent results if run multiple times (failing the "bootstrapping test").
The accuracy of the classifier is inherently dependent on the quality and accuracy of the retrieval step (finding relevant candidates).
A less order-biased alternative would be to vectorize all data upfront, perform clustering, and then use the LLM to label the resulting clusters.
For cases with a fixed, pre-defined set of categories, a simpler approach of directly comparing item embeddings to category embeddings might suffice or even be more reliable than LLM-generated tags, and sometimes direct LLM classification can outperform embedding-based approaches for specific fixed-label tasks.
The approach might be less reliable than using a carefully curated set of representative examples for each targeted class.