Scaling Local RAG: Lessons from Indexing 451GB of Data

The author successfully built a local RAG system to query 451 GB of company technical documents using Ollama, LlamaIndex, and ChromaDB. By implementing aggressive file filtering, batch processing, and cloud-based document serving, they overcame significant memory and storage constraints. The project highlights that robust data preparation and high-performance hardware are essential for scaling AI tools to enterprise-level datasets.

Key Points

Aggressive data filtering and file-type exclusion are mandatory to prevent memory exhaustion and improve search relevance in large-scale RAG systems.
Scaling to hundreds of gigabytes requires moving from in-memory indexing to a dedicated vector database like ChromaDB with batch processing and checkpoints.
Hardware is a major bottleneck; embedding large datasets requires high-performance GPUs, as CPU processing is too slow for production timelines.
Decoupling the vector index from the source documents using cloud storage and SAS tokens allows for deployment on resource-constrained virtual machines.

Sentiment

Mostly positive and constructive. The community broadly validates the article's core lesson that RAG is hard and data quality is paramount, while offering experienced corrections on technical choices and pushing the conversation toward more sophisticated approaches like agentic RAG and hybrid retrieval.

In Agreement

Data ingestion and preprocessing quality is the decisive factor in RAG success, confirming the article's core thesis that simply dumping documents into a vector database produces poor results
RAG remains essential and is not obsolete despite growing context windows, because enterprise data volumes far exceed any context window and model performance degrades with excessive unfocused context
Building a real RAG system requires serious engineering effort including custom chunking strategies, metadata management, and domain-specific preprocessing that no off-the-shelf solution handles well
The compute cost of indexing is trivial compared to the engineering time required to build a quality pipeline, validating the article's experience of weeks of iteration

Opposed

The article's misattribution of ChromaDB to Google raises credibility concerns and suggests the author may lack deep familiarity with the tools being used
The article missed important techniques like re-ranking, which can allow cheaper embedding models while maintaining quality, and the use of hybrid search combining semantic and keyword approaches
Simple RAG as described in the article is considered dead by some practitioners who argue that agentic RAG with LLM-driven multi-round queries is fundamentally different and necessary for production use
For many enterprise use cases, a good search and retrieval system may be more appropriate than a full RAG system, and deterministic database queries should be preferred when the data structure supports them