Chunking strategie pro RAG

Chunking ist eine Schluesseltechnik fuer erfolgreiche RAG-Systeme (Retrieval-Augmented Generation), die bestimmt, wie effektiv wir Text in kleinere Teile aufteilen. The right choice of chunking strategy fundamentally affects the quality of relevant information retrieval and subsequent response generation using language models.

Chunking-Strategien fuer RAG: Schluessel zum effektiven Retrieval¶

Retrieval-Augmented Generation (RAG) has become the standard for creating AI applications that need to work with large volumes of data. However, the success of a RAG system fundamentally depends on the quality of chunking strategy – the way we divide documents into smaller parts for embedding and subsequent retrieval.

Warum ist Chunking entscheidend?¶

Embedding models have input length limitations (usually 512-8192 tokens) and their performance decreases with increasing text length. Poorly designed chunking can lead to:

Loss of context between related information
Inefficient retrieval of relevant passages
Fragmentation of semantically related blocks
High latency and inference costs

Basic Chunking Strategies¶

Fixed-Size Chunking¶

The simplest approach divides text into fixed-size blocks with optional overlap:

def fixed_size_chunking(text, chunk_size=500, overlap=50):
    chunks = []
    start = 0

    while start < len(text):
        end = start + chunk_size
        chunk = text[start:end]
        chunks.append(chunk)
        start = end - overlap

    return chunks

# Usage
text = "Your long document..."
chunks = fixed_size_chunking(text, chunk_size=1000, overlap=100)

Vorteile: Simplicity, predictable chunk size. Nachteile: May split sentences or paragraphs in the middle, ignores document structure.

Semantisches Chunking¶

A more advanced approach uses NLP techniques to preserve semantic integrity:

import spacy
from sentence_transformers import SentenceTransformer
import numpy as np

class SemanticChunker:
    def __init__(self):
        self.nlp = spacy.load("en_core_web_sm")  # English model
        self.embedding_model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')

    def chunk_by_similarity(self, text, similarity_threshold=0.7, max_chunk_size=1000):
        doc = self.nlp(text)
        sentences = [sent.text.strip() for sent in doc.sents]

        if len(sentences) <= 1:
            return [text]

        embeddings = self.embedding_model.encode(sentences)
        chunks = []
        current_chunk = [sentences[0]]

        for i in range(1, len(sentences)):
            # Calculate similarity with previous sentence
            similarity = np.dot(embeddings[i-1], embeddings[i]) / (
                np.linalg.norm(embeddings[i-1]) * np.linalg.norm(embeddings[i])
            )

            # Check chunk size
            current_text = " ".join(current_chunk + [sentences[i]])

            if similarity > similarity_threshold and len(current_text) < max_chunk_size:
                current_chunk.append(sentences[i])
            else:
                chunks.append(" ".join(current_chunk))
                current_chunk = [sentences[i]]

        if current_chunk:
            chunks.append(" ".join(current_chunk))

        return chunks

Strukturbewusstes Chunking¶

For structured documents (HTML, Markdown, PDF), it’s effective to respect content hierarchy:

from bs4 import BeautifulSoup
import re

class StructureChunker:
    def __init__(self, max_chunk_size=1000):
        self.max_chunk_size = max_chunk_size

    def chunk_html(self, html_content):
        soup = BeautifulSoup(html_content, 'html.parser')
        chunks = []

        # Split by main sections
        sections = soup.find_all(['h1', 'h2', 'h3', 'section', 'article'])

        for section in sections:
            section_text = self._extract_section_content(section)

            if len(section_text) > self.max_chunk_size:
                # Recursively split larger sections
                sub_chunks = self._split_large_section(section_text)
                chunks.extend(sub_chunks)
            else:
                chunks.append({
                    'text': section_text,
                    'metadata': {
                        'tag': section.name,
                        'heading': section.get_text()[:100] if section.name.startswith('h') else None
                    }
                })

        return chunks

    def _extract_section_content(self, element):
        # Get text including all descendants until next heading
        content = []
        current = element

        while current and current.next_sibling:
            current = current.next_sibling
            if hasattr(current, 'name') and current.name and current.name.startswith('h'):
                break
            if hasattr(current, 'get_text'):
                content.append(current.get_text())

        return " ".join(content).strip()

Hybride Ansaetze¶

In der Praxis, we achieve the best results by combining multiple strategies:

class HybridChunker:
    def __init__(self):
        self.semantic_chunker = SemanticChunker()
        self.structure_chunker = StructureChunker()

    def chunk_document(self, content, doc_type='text'):
        if doc_type == 'html':
            # First structure-aware chunking
            structural_chunks = self.structure_chunker.chunk_html(content)
            final_chunks = []

            for chunk in structural_chunks:
                # Then semantic chunking for larger blocks
                if len(chunk['text']) > 1200:
                    semantic_chunks = self.semantic_chunker.chunk_by_similarity(
                        chunk['text'], max_chunk_size=1000
                    )
                    for i, sem_chunk in enumerate(semantic_chunks):
                        final_chunks.append({
                            'text': sem_chunk,
                            'metadata': {
                                **chunk['metadata'],
                                'sub_chunk': i
                            }
                        })
                else:
                    final_chunks.append(chunk)

            return final_chunks

        else:
            # For plain text use only semantic chunking
            return self.semantic_chunker.chunk_by_similarity(content)

Optimierung fuer verschiedene Inhaltstypen¶

Different document types require specific approaches:

Technical documentation: Respect sections, code blocks, and hierarchy
Legal documents: Preserve paragraph numbering and references
Scientific articles: Keep together abstracts, methodologies, and conclusions
Chatbots: Short chunks with high overlap for precise answers

Evaluierung der Chunking-Strategie¶

To measure chunking strategy quality, we use metrics:

def evaluate_chunking_strategy(chunks, queries, ground_truth):
    from sklearn.metrics.pairwise import cosine_similarity

    # Embed chunks and queries
    chunk_embeddings = embedding_model.encode([c['text'] for c in chunks])
    query_embeddings = embedding_model.encode(queries)

    metrics = {
        'avg_chunk_size': np.mean([len(c['text']) for c in chunks]),
        'chunk_size_variance': np.var([len(c['text']) for c in chunks]),
        'retrieval_accuracy': 0
    }

    # Eval retrieval accuracy
    correct_retrievals = 0
    for i, query in enumerate(queries):
        similarities = cosine_similarity([query_embeddings[i]], chunk_embeddings)[0]
        top_chunk_idx = np.argmax(similarities)

        if chunks[top_chunk_idx]['id'] in ground_truth[i]:
            correct_retrievals += 1

    metrics['retrieval_accuracy'] = correct_retrievals / len(queries)
    return metrics

Produktionstipps¶

Fuer den Produktiveinsatz, we recommend:

Caching embeddings for frequently used chunks
Asynchronous processing for large documents
Monitoring metrics like chunk retrieval rate and response relevance
A/B testing different chunking strategies
Periodic re-chunking when changing embedding models

# Async chunking for production
import asyncio
from concurrent.futures import ThreadPoolExecutor

class ProductionChunker:
    def __init__(self):
        self.executor = ThreadPoolExecutor(max_workers=4)
        self.chunker = HybridChunker()
        self.cache = {}

    async def chunk_document_async(self, doc_id, content, doc_type='text'):
        if doc_id in self.cache:
            return self.cache[doc_id]

        loop = asyncio.get_event_loop()
        chunks = await loop.run_in_executor(
            self.executor, 
            self.chunker.chunk_document, 
            content, 
            doc_type
        )

        self.cache[doc_id] = chunks
        return chunks

Zusammenfassung¶

Eine qualitativ hochwertige Chunking-Strategie ist das Fundament jedes erfolgreichen RAG-Systems. The combination of semantic awareness, structural integrity, and optimization for specific use cases leads to significantly better results than simple fixed-size chunking. Time investment in designing and testing the chunking pipeline pays off in the form of higher response relevance and better user experience. Don’t forget to regularly measure and optimize your strategy based on real-world data.

chunkingragnlp

CORE SYSTEMS Team

Wir bauen Kernsysteme und KI-Agenten, die den Betrieb am Laufen halten. 15 Jahre Erfahrung mit Enterprise-IT.

Alle Artikel