Large language models can generate text, but they have a fundamental problem — they hallucinate. They invent facts, cite nonexistent sources, and assert nonsense with confidence. RAG (Retrieval Augmented Generation) solves this problem elegantly: instead of relying on the model’s memory, we feed it relevant data from our own sources in real time. Here’s our experience from enterprise implementations.
Why an LLM Alone Isn’t Enough¶
GPT-4, Claude, Gemini — all these models have a knowledge cutoff. They know nothing about your internal documents, current pricing, or company processes. Fine-tuning is expensive, slow, and must be repeated with every data change. RAG offers an alternative: the model remains general-purpose, but with each query it receives context from your knowledge base.
In practice, this means a customer chatbot doesn’t answer “I think that…” but “according to document XY from January 2024, the following applies…” And that’s a fundamental difference for enterprise deployment, where you can’t afford hallucinations.
RAG Pipeline Architecture¶
A basic RAG pipeline has three phases: indexing, retrieval, and generation.
Indexing: Documents (PDF, Word, Confluence, databases) are split into chunks (typically 500–1,000 tokens), converted into embedding vectors using a model (OpenAI ada-002, Cohere embed, or open-source alternatives like nomic-embed), and stored in a vector database.
Retrieval: The user query is converted into an embedding using the same model. The vector database finds the most similar chunks (typically top-k = 5–10). Optionally, a re-ranking model is added for more precise ordering.
Generation: The retrieved chunks are inserted into the prompt as context. The LLM generates an answer based on this context with references to source documents.
Choosing a Vector Database¶
The market in 2024 offers a surprising number of options. Specialized vector DBs like Pinecone (managed, easy start), Weaviate (open-source, hybrid search), Qdrant (Rust, performance), or Milvus (enterprise scale). But traditional databases are also adding vector support — pgvector for PostgreSQL, Azure AI Search, Elasticsearch kNN.
Our recommendation: if you already have PostgreSQL, start with pgvector. For larger volumes (millions of documents) and advanced filters, go with a dedicated solution. Managed services (Pinecone, Azure AI Search) are ideal if you don’t want to manage infrastructure.
Chunking — The Art of Splitting a Document¶
RAG quality stands and falls with chunking. Chunks that are too small lose context. Chunks that are too large dilute the relevant information with noise. In practice, a combination works: semantic chunking (splitting along semantic boundaries — headings, paragraphs) plus overlap (10–20% overlap between chunks to preserve context).
For structured documents (contracts, legislation), we use hierarchical chunking — metadata about the section, chapter, and document are added to each chunk. During retrieval, we can then filter: “search only in contracts from 2024” or “only in the Pricing section.”
Advanced Techniques: Beyond Naive RAG¶
Basic RAG is a good start, but enterprise deployment requires more:
- Hybrid search: Combining vector (semantic) and keyword (BM25) search. Semantics finds synonyms; keywords find exact terms.
- Query transformation: Users ask vaguely. HyDE (Hypothetical Document Embeddings) first generates a hypothetical answer and uses it for retrieval.
- Multi-step retrieval: Complex questions are decomposed into sub-questions, each evaluated independently, with results aggregated.
- Re-ranking: A cross-encoder model (Cohere Rerank, BGE Reranker) reorders results by actual relevance to the query.
- Agentic RAG: The LLM decides whether it even needs retrieval, which source to use, and whether the answer quality is sufficient.
Evaluation — How to Measure Quality¶
RAG without evaluation is like code without tests — it works until it doesn’t. We measure three dimensions: retrieval quality (did we find the right documents?), generation quality (is the answer correct and grounded?), and end-to-end (is the user satisfied?).
Tools like RAGAS, DeepEval, or custom eval sets with golden questions enable automated testing. Key metrics: faithfulness (the answer matches the context), answer relevancy (the answer addresses the question), and context precision (the retrieved documents are relevant).
Security and Governance¶
In enterprise environments, it’s critical to address access rights. If a document has a “confidential” classification, RAG must not return its content to a user without appropriate authorization. We implement metadata-based filtering at the vector database level — each chunk carries ACL metadata, and during retrieval, filtering is applied based on user identity.
RAG as the Foundation of Enterprise AI¶
RAG isn’t a silver bullet, but it’s the most practical way to get LLMs into production with company data. Start simple — pgvector, basic chunking, one data source. Iterate based on evaluation. And remember: 80% of RAG success lies in data quality and chunking, not in model selection.
Need help with implementation?
Our experts can help with design, implementation, and operations. From architecture to production.
Contact us