Why Embeddings Got Bigger—and Where Efficiency Pulls Them Next

Embedding sizes have grown from 200–300 to 768, 1536, and up to 4096+ as Transformers, GPUs, and API-driven commoditization reshaped norms. Standardized tooling, benchmarks, and vector infrastructure broadened adoption while inference-oriented models and new training techniques showed many tasks can use fewer dimensions. The field is now balancing performance with storage and latency, suggesting future emphasis on efficiency rather than unbounded size growth.
Key Points
- Transformer architecture and GPU parallelism shaped embedding sizes, with BERT’s 768-d choice (12 heads × 64) becoming a de facto standard and influencing GPT-2 and CLIP.
- Standardization through HuggingFace and the rise of hosted APIs commoditized embeddings, pushing dimensions upward (e.g., OpenAI’s 1536-d) and making them broadly accessible.
- Inference-focused models like SBERT/MiniLM (384-d) showed smaller, efficient embeddings can perform well for sentence/document tasks.
- Benchmarking (MTEB) and commoditized vector infrastructure (pgvector, S3 vectors, Elasticsearch) lowered barriers to adoption and tuning, supporting a wide range of sizes (768–4096+).
- New techniques (matryoshka learning, truncation) suggest many tasks don’t need full dimensionality, signaling a potential slowdown in size growth in favor of efficiency.
Sentiment
The community is constructively skeptical. While commenters broadly accept that embeddings have grown alongside LLM scaling, they push back on the article's framing that this growth is inevitable or architecturally necessary. Several highlight that standalone embedding products are not bound to LLM architectures and that efficiency-focused models are already matching larger ones. The tone is respectful and technically informed rather than dismissive.
In Agreement
- Jevons paradox applies: when compute allows larger embeddings, they get used even with diminishing returns
- Width-first scaling means embedding dimensions naturally grow alongside model depth because narrow embeddings bottleneck deeper networks
- LLMs train embeddings jointly, making larger embeddings available to downstream users for free
- Real-world usage confirms a split: bigger embeddings for cloud API small-data tasks, smaller for cost-sensitive RAG
Opposed
- The article conflates embeddings within LLM architecture with standalone embedding products, which aren't intrinsically connected
- More dimensions don't necessarily mean better performance due to the curse of dimensionality
- Newer smaller models like EmbeddingGemma at 768 dimensions already beat larger models on benchmarks, undermining the narrative of inevitable growth
- Compressed cosine similarity in modern embeddings is misleading without understanding the shift from classification-oriented to retrieval-oriented training