_CORE
AI & Agentic Systems Core Information Systems Cloud & Platform Engineering Data Platform & Integration Security & Compliance QA, Testing & Observability IoT, Automation & Robotics Mobile & Digital Banking & Finance Insurance Public Administration Defense & Security Healthcare Energy & Utilities Telco & Media Manufacturing Logistics & E-commerce Retail & Loyalty
References Technologies Blog Know-how Tools
About Collaboration Careers
CS EN
Let's talk

RAG — Retrieval Augmented Generation in Practice

18. 03. 2024 4 min read CORE SYSTEMSai
RAG — Retrieval Augmented Generation in Practice

Large language models can generate text, but they have a fundamental problem — they hallucinate. They invent facts, cite nonexistent sources, and assert nonsense with confidence. RAG (Retrieval Augmented Generation) solves this problem elegantly: instead of relying on the model’s memory, we feed it relevant data from our own sources in real time. Here’s our experience from enterprise implementations.

Why an LLM Alone Isn’t Enough

GPT-4, Claude, Gemini — all these models have a knowledge cutoff. They know nothing about your internal documents, current pricing, or company processes. Fine-tuning is expensive, slow, and must be repeated with every data change. RAG offers an alternative: the model remains general-purpose, but with each query it receives context from your knowledge base.

In practice, this means a customer chatbot doesn’t answer “I think that…” but “according to document XY from January 2024, the following applies…” And that’s a fundamental difference for enterprise deployment, where you can’t afford hallucinations.

RAG Pipeline Architecture

A basic RAG pipeline has three phases: indexing, retrieval, and generation.

Indexing: Documents (PDF, Word, Confluence, databases) are split into chunks (typically 500–1,000 tokens), converted into embedding vectors using a model (OpenAI ada-002, Cohere embed, or open-source alternatives like nomic-embed), and stored in a vector database.

Retrieval: The user query is converted into an embedding using the same model. The vector database finds the most similar chunks (typically top-k = 5–10). Optionally, a re-ranking model is added for more precise ordering.

Generation: The retrieved chunks are inserted into the prompt as context. The LLM generates an answer based on this context with references to source documents.

Choosing a Vector Database

The market in 2024 offers a surprising number of options. Specialized vector DBs like Pinecone (managed, easy start), Weaviate (open-source, hybrid search), Qdrant (Rust, performance), or Milvus (enterprise scale). But traditional databases are also adding vector support — pgvector for PostgreSQL, Azure AI Search, Elasticsearch kNN.

Our recommendation: if you already have PostgreSQL, start with pgvector. For larger volumes (millions of documents) and advanced filters, go with a dedicated solution. Managed services (Pinecone, Azure AI Search) are ideal if you don’t want to manage infrastructure.

Chunking — The Art of Splitting a Document

RAG quality stands and falls with chunking. Chunks that are too small lose context. Chunks that are too large dilute the relevant information with noise. In practice, a combination works: semantic chunking (splitting along semantic boundaries — headings, paragraphs) plus overlap (10–20% overlap between chunks to preserve context).

For structured documents (contracts, legislation), we use hierarchical chunking — metadata about the section, chapter, and document are added to each chunk. During retrieval, we can then filter: “search only in contracts from 2024” or “only in the Pricing section.”

Advanced Techniques: Beyond Naive RAG

Basic RAG is a good start, but enterprise deployment requires more:

  • Hybrid search: Combining vector (semantic) and keyword (BM25) search. Semantics finds synonyms; keywords find exact terms.
  • Query transformation: Users ask vaguely. HyDE (Hypothetical Document Embeddings) first generates a hypothetical answer and uses it for retrieval.
  • Multi-step retrieval: Complex questions are decomposed into sub-questions, each evaluated independently, with results aggregated.
  • Re-ranking: A cross-encoder model (Cohere Rerank, BGE Reranker) reorders results by actual relevance to the query.
  • Agentic RAG: The LLM decides whether it even needs retrieval, which source to use, and whether the answer quality is sufficient.

Evaluation — How to Measure Quality

RAG without evaluation is like code without tests — it works until it doesn’t. We measure three dimensions: retrieval quality (did we find the right documents?), generation quality (is the answer correct and grounded?), and end-to-end (is the user satisfied?).

Tools like RAGAS, DeepEval, or custom eval sets with golden questions enable automated testing. Key metrics: faithfulness (the answer matches the context), answer relevancy (the answer addresses the question), and context precision (the retrieved documents are relevant).

Security and Governance

In enterprise environments, it’s critical to address access rights. If a document has a “confidential” classification, RAG must not return its content to a user without appropriate authorization. We implement metadata-based filtering at the vector database level — each chunk carries ACL metadata, and during retrieval, filtering is applied based on user identity.

RAG as the Foundation of Enterprise AI

RAG isn’t a silver bullet, but it’s the most practical way to get LLMs into production with company data. Start simple — pgvector, basic chunking, one data source. Iterate based on evaluation. And remember: 80% of RAG success lies in data quality and chunking, not in model selection.

ragllmvector dbenterprise ai
Share:

CORE SYSTEMS

Stavíme core systémy a AI agenty, které drží provoz. 15 let zkušeností s enterprise IT.

Need help with implementation?

Our experts can help with design, implementation, and operations. From architecture to production.

Contact us