Why Embeddings Got Bigger—and Where Efficiency Pulls Them Next

Embedding sizes have grown from 200–300 to 768, 1536, and up to 4096+ as Transformers, GPUs, and API-driven commoditization reshaped norms. Standardized tooling, benchmarks, and vector infrastructure broadened adoption while inference-oriented models and new training techniques showed many tasks can use fewer dimensions. The field is now balancing performance with storage and latency, suggesting future emphasis on efficiency rather than unbounded size growth.

Key Points

Transformer architecture and GPU parallelism shaped embedding sizes, with BERT’s 768-d choice (12 heads × 64) becoming a de facto standard and influencing GPT-2 and CLIP.
Standardization through HuggingFace and the rise of hosted APIs commoditized embeddings, pushing dimensions upward (e.g., OpenAI’s 1536-d) and making them broadly accessible.
Inference-focused models like SBERT/MiniLM (384-d) showed smaller, efficient embeddings can perform well for sentence/document tasks.
Benchmarking (MTEB) and commoditized vector infrastructure (pgvector, S3 vectors, Elasticsearch) lowered barriers to adoption and tuning, supporting a wide range of sizes (768–4096+).
New techniques (matryoshka learning, truncation) suggest many tasks don’t need full dimensionality, signaling a potential slowdown in size growth in favor of efficiency.

Sentiment

Moderately agreeing with the article’s thesis (upward pressure on embedding sizes tied to transformer scaling and availability), tempered by pragmatic counterpoints highlighting effective low-dimensional alternatives and questioning a strict linkage to LLM size.

In Agreement

Embedding dimensionality is a fundamental, jointly learned choice in transformers; it sets the representational rank and too-narrow embeddings bottleneck deeper models, so width naturally scales with capability.
LLMs train and benefit from larger embeddings, making high-dimensional embeddings widely available with less internal cost pressure; this availability propagates larger sizes across use cases.
Jevons paradox framing: as it becomes easy to emit 4096D vectors, practitioners tend to use them despite diminishing returns.
Embeddings are the coordinates in a transformer’s latent space and are integral to all LLMs, not an isolated add-on.

Opposed

Embedding models can be trained separately from LLMs, so growth in LLM size shouldn’t inherently dictate embedding dimensionality.
Smaller, efficient models can match or beat large-dimension models; e.g., EmbeddingGemma-300m at 768D reportedly outperforms 4096D Qwen-3 on benchmarks, and MRL-based 128D variants beat many 768D models.
The true bottleneck is often data and training regimen rather than dimensionality alone, so simply increasing D is not a guaranteed win.