Learn effective chunking strategies including fixed-size, semantic, recursive, and sentence-window approaches for optimal RAG retrieval.
Chunking determines how documents are split for embedding. Fixed-size chunks are simple but may break semantic units. Semantic chunking splits at natural boundaries. Recursive chunking tries multiple separators hierarchically. Sentence-window chunking embeds sentences but retrieves surrounding context. Most systems use 256-1024 tokens with 10-20% overlap.
Chunking is the most underrated lever in RAG. The chunks you index define what the retriever can return — not what's in the source document. A good question can fail to retrieve the right answer because the answer was split across two chunks, or buried in a chunk that's too large to score well against the query. Bad chunks cap the ceiling of every downstream optimization: better embeddings, better rerankers, and better prompts cannot recover information that was never in the candidate pool.
In our production RAG work, switching chunking strategy alone has moved Context Recall from ~60% to ~85% on the same corpus and embedding model. The gain is larger than most prompt-engineering changes and comes at one-time ingest cost rather than per-query latency.
Fixed-size chunking splits documents at a target token count (typically 256–512 tokens) with a configurable overlap (typically 10–20% of chunk size). It is the simplest strategy and the right default for prototyping.
Pros: trivial to implement, predictable index size, works acceptably on prose-heavy content (articles, reports, transcripts).
Cons: ignores semantic boundaries. A definition can be split from its example, a heading from its body, a table row from its header. This shows up as low Context Precision: the retriever returns the chunk containing the keyword but not the chunk with the actual answer.
For production systems beyond prototype, fixed-size is rarely the optimal choice — but it remains the right baseline to measure other strategies against.
Recursive chunking splits hierarchically: first by major separators (double newlines, section headers), then by minor separators (single newlines, sentences), only falling back to character splits if no semantic boundary fits within the size budget. LangChain's RecursiveCharacterTextSplitter is the most common implementation. This preserves paragraph and section integrity at low engineering cost.
Semantic chunking goes further: it embeds sentences, then groups adjacent sentences whose embeddings are similar (above a cosine threshold) into the same chunk. The implementation costs one embedding pass per sentence at ingest time but produces chunks that genuinely correspond to topic units. For Q&A over unstructured prose (interview transcripts, meeting notes, research reports), semantic chunking typically improves Context Recall by 10–20% over fixed-size.
The tradeoff is variable chunk size, which complicates retrieval-time budget management (top-K may pull very different token volumes). Combine semantic chunking with a hard upper bound (e.g., max 1024 tokens per chunk, splitting on sentence if exceeded).
For PDFs, HTML, Markdown, or Word documents with explicit structure, layout-aware chunking dominates pure text-based approaches. Tools like Unstructured.io, LlamaParse, and Docling extract document structure (headings, tables, lists, figures) and let you chunk along those boundaries.
Effective patterns:
Layout-aware chunking is what separates a RAG demo from a RAG product on enterprise documents. The engineering cost is real (parsing, structure extraction, edge cases) but the retrieval quality gain on structured content is the largest single improvement available.
Overlap is the simplest defense against semantic-boundary splits. Standard guidance: 10–20% overlap (e.g., 50–100 tokens for a 512-token chunk). The reason is straightforward: if an answer happens to straddle two chunks, overlap ensures it appears intact in at least one of them.
Larger overlaps (above 25%) waste index space and inflate retrieval-time token consumption without proportional quality gains. Smaller overlaps (under 5%) provide little protection. The exception is sentence-window retrieval: index single sentences, then retrieve neighboring sentences (±N) at query time. This decouples retrieval granularity from the context window passed to the LLM.
Parent-child (sometimes called "small-to-big") chunking solves a fundamental tension: small chunks score better in retrieval (specific, focused), but small chunks lack the surrounding context an LLM needs to answer well.
The pattern:
LlamaIndex's AutoMergingRetriever implements this. It is particularly effective for FAQ-style queries over long-form content (legal contracts, technical manuals) where the answer is a specific sentence but understanding it requires the surrounding context.
Do not pick a chunking strategy by intuition. Build a small evaluation set (50–200 questions with known-good answer documents) and measure Context Recall (does the retrieved set contain the answer?) and Context Precision (how much of the retrieved set is relevant?) for each candidate strategy. Ragas and TruLens both implement these metrics out of the box.
In our benchmarks across enterprise document corpora, layout-aware chunking with parent-child retrieval consistently outperforms fixed-size and pure semantic chunking on both Context Recall (typically 80–90%) and Context Precision (60–75%). But the magnitude of the gap depends entirely on document structure: on flat prose corpora, semantic chunking is competitive at lower engineering cost.
For Indian enterprises building production RAG — across Bangalore, Coimbatore, and elsewhere — we typically run a 1–2 week chunking benchmark before locking in production architecture. We build a representative evaluation set from your actual user queries, implement 3–4 chunking strategies on a sample of your corpus, and measure Context Recall and Precision side by side. The output is a chunking strategy chosen by measured performance on your data, not by what worked on someone else's.
This evidence-based approach has saved clients from premature optimization (e.g., investing in layout parsing when their corpus was already flat prose) and from premature simplification (e.g., shipping fixed-size chunks on a corpus where heading-aware splits would have improved Recall by 20 points).
The order matters. Most teams skip steps 1–4 and go straight to semantic chunking because it sounds sophisticated. In our experience, structure-aware chunking on enterprise content beats semantic chunking nearly every time at lower runtime cost.
From guide to production
Our team has hands-on experience implementing these systems. Book a free architecture call to discuss your specific requirements and get a clear delivery plan.
Share your project details and we'll get back to you within 24 hours with a free consultation—no commitment required.
Boolean and Beyond
825/90, 13th Cross, 3rd Main
Mahalaxmi Layout, Bengaluru - 560086
590, Diwan Bahadur Rd
Near Savitha Hall, R.S. Puram
Coimbatore, Tamil Nadu 641002