Chunking Strategies for RAG: Fixed-Size, Semantic, and Hierarchical

Chunking is the first decision in any RAG pipeline and one of the most consequential. Before an embedding model can index your documents or a retriever can search them, every document must be split into chunks — the units of text that will be stored in your vector database and returned as context. Chunk size and strategy directly determine retrieval quality: chunks that are too small lose the surrounding context needed to interpret a sentence; chunks that are too large dilute the relevant signal with unrelated content and push retrieval precision down. Getting chunking right is often worth more than switching embedding models or vector databases, yet it receives far less attention in most RAG implementations.

This guide covers the four main chunking approaches used in production: fixed-size chunking, recursive character splitting, semantic chunking, and hierarchical (parent-child) chunking. Each solves a different failure mode, and the right choice depends on your document types, query patterns, and latency budget.

Fixed-Size Chunking

Fixed-size chunking splits documents into chunks of exactly N tokens (or characters) with an optional overlap between consecutive chunks. It is the simplest approach, the fastest to implement, and the correct baseline to establish before trying anything more complex. Overlap is important: without it, a sentence that straddles a chunk boundary is split in half, with each half appearing in a different chunk with no shared context. An overlap of 10–20% of the chunk size is standard:

from langchain.text_splitter import RecursiveCharacterTextSplitter
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')

def count_tokens(text: str) -> int:
    return len(tokenizer.encode(text))

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,           # tokens per chunk
    chunk_overlap=64,         # overlap between chunks (~12.5%)
    length_function=count_tokens,
    separators=['

', '
', '. ', ' ', ''],  # try these in order
)

with open('document.txt') as f:
    text = f.read()

chunks = splitter.split_text(text)
print(f'{len(chunks)} chunks, avg {sum(count_tokens(c) for c in chunks)/len(chunks):.0f} tokens')

RecursiveCharacterTextSplitter is the standard implementation: it tries to split on paragraph breaks first, then newlines, then sentence boundaries, then spaces, falling back to hard character splits only when necessary. This respects natural text boundaries better than a pure character-count split. Use token count rather than character count as your length function — the same 512-character string can be 80 tokens or 200 tokens depending on vocabulary, and embedding models have token limits, not character limits.

Fixed-size chunking works well when documents have consistent structure (all roughly the same length and topic density), when speed of indexing matters more than retrieval precision, and as a baseline to beat. Its main failure mode is semantic incoherence: a 512-token window cut at a fixed boundary will sometimes land mid-argument, mid-table, or mid-code block, producing chunks that are syntactically complete but contextually incomplete. For documents with high information density and heterogeneous structure — technical manuals, legal contracts, academic papers — the quality ceiling of fixed-size chunking is meaningfully lower than semantic approaches.

Semantic Chunking

Semantic chunking splits documents at points of high semantic discontinuity rather than at fixed intervals. The algorithm embeds consecutive sentences, computes cosine similarity between adjacent sentence embeddings, and inserts a chunk boundary wherever the similarity drops below a threshold — indicating a topic shift. This produces chunks that are semantically coherent by construction: each chunk contains a complete thought or topic segment rather than an arbitrary window of tokens:

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings
# or use a local model:
from langchain_community.embeddings import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(
    model_name='sentence-transformers/all-MiniLM-L6-v2',
    model_kwargs={'device': 'cpu'},
)

semantic_splitter = SemanticChunker(
    embeddings=embeddings,
    breakpoint_threshold_type='percentile',  # split at top 95th percentile similarity drops
    breakpoint_threshold_amount=95,
)

chunks = semantic_splitter.split_text(text)
print(f'{len(chunks)} semantic chunks')

Semantic chunking produces higher-quality chunks than fixed-size splitting for heterogeneous documents, but at the cost of embedding every sentence during indexing — roughly 3–5x the indexing cost of fixed-size chunking. The resulting chunks also have variable length, which complicates batch processing and may produce some very long chunks (for dense sections with high within-topic similarity) that approach or exceed embedding model context limits. Always check the distribution of chunk lengths after semantic chunking and add a maximum chunk length safeguard that falls back to recursive splitting for oversized chunks.

Hierarchical (Parent-Child) Chunking

Hierarchical chunking solves a fundamental tension in RAG: small chunks retrieve more precisely (high semantic specificity), but large chunks provide more context to the model (better generation quality). Parent-child chunking resolves this by indexing small child chunks for retrieval but returning their larger parent chunk as the context passed to the model. You get the retrieval precision of small chunks with the generation quality of large chunks:

from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(model_name='sentence-transformers/all-MiniLM-L6-v2')
vectorstore = Chroma(embedding_function=embeddings)
docstore = InMemoryStore()  # use Redis or a DB for production

# Child splitter: small chunks for precise retrieval
child_splitter = RecursiveCharacterTextSplitter(chunk_size=200, chunk_overlap=20, length_function=count_tokens)
# Parent splitter: larger chunks for context-rich generation
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=800, chunk_overlap=80, length_function=count_tokens)

retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=docstore,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)

from langchain.schema import Document
docs = [Document(page_content=text, metadata={'source': 'document.txt'})]
retriever.add_documents(docs)

# At query time: retrieves child chunk by similarity, returns parent for context
results = retriever.get_relevant_documents('What is the refund policy?')
# results contain the full parent chunks, not the small child chunks

Parent-child chunking is the right choice for dense technical documents, legal text, or any corpus where the answer to a question depends on its surrounding context — a specific clause in a contract only makes sense in the context of the section it belongs to; a code snippet only makes sense with the surrounding explanation. The cost is doubled indexing (both parent and child chunks must be stored) and the added complexity of the two-level retrieval. The InMemoryStore in the example above is only suitable for development; in production, use a persistent key-value store like Redis for the docstore, with the chunk ID as the key and the parent document as the value.

Document-Aware Chunking

For structured documents — PDFs with sections and headers, Markdown files, HTML pages, code files — document-aware chunking uses the document’s own structure as natural chunk boundaries. Splitting a PDF at section headings, splitting a Markdown file at H2/H3 headers, or splitting a Python file at function or class definitions produces chunks that correspond to meaningful semantic units, with no risk of splitting mid-argument. This approach requires a parser that understands the document format:

from langchain.document_loaders import UnstructuredMarkdownLoader
from langchain.text_splitter import MarkdownHeaderTextSplitter

# Split Markdown by headers — each chunk contains one section
md_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=[
        ('#', 'H1'), ('##', 'H2'), ('###', 'H3'),
    ],
    strip_headers=False,  # keep headers in chunk for context
)

with open('docs.md') as f:
    md_text = f.read()

md_chunks = md_splitter.split_text(md_text)
# Each chunk has metadata: {'H1': 'Introduction', 'H2': 'Installation', ...}
for chunk in md_chunks:
    print(chunk.metadata, '->', chunk.page_content[:80])

Document-aware chunking produces the highest-quality chunks for structured corpora but requires format-specific parsing logic for each document type. For a corpus with a consistent format (all Markdown, all HTML from a CMS, all Python source), it is almost always worth implementing. For a heterogeneous corpus (PDFs, Word docs, web pages, CSVs), the parsing overhead may outweigh the quality improvement over well-tuned semantic chunking.

Choosing a Chunking Strategy

Start with recursive fixed-size chunking at 512 tokens with 10% overlap as your baseline. Measure context recall on your eval set — if it is above 0.85, fixed-size chunking may be sufficient and adding complexity is not justified. If recall is lower, diagnose why: inspect the missed chunks and ask whether the failures are due to semantic incoherence (switch to semantic chunking), context dependency (switch to parent-child), or document structure misalignment (switch to document-aware). For most production RAG systems, the highest-leverage improvements come from the retrieval layer (re-ranking, hybrid search) rather than chunking strategy changes after a reasonable baseline is established — but a poor chunking strategy can set a ceiling on retrieval quality that no amount of retrieval sophistication can overcome.

Chunk size deserves empirical tuning on your specific corpus and query distribution. The optimal chunk size varies significantly by domain: for dense technical documentation, 256–512 tokens often works best; for narrative text or customer support conversations, 512–1024 tokens is typical; for legal text with long clause dependencies, 1024 tokens or parent-child with 256/1024 split is often required. Run an ablation across chunk sizes on your RAGAS eval set and pick the size with the best context recall before tuning anything else — chunk size is the highest-impact single hyperparameter in a RAG pipeline and is almost never worth leaving at a default without validation.

Chunk Metadata and Its Impact on Retrieval

Every chunk should carry metadata that travels with it through indexing and retrieval. At minimum: the source document identifier, the page or section number, the position of the chunk within the document (first, middle, last), and the document creation or last-modified date. This metadata serves three purposes. First, it enables metadata filtering at retrieval time — if a user asks a question that is clearly scoped to a specific document or date range, filtering by metadata before similarity search dramatically improves precision without any additional re-ranking overhead. Second, it enables source attribution in the generated response — users and auditors can verify claims against the original document. Third, it enables position-aware retrieval heuristics: the first chunk of a document often contains a title and abstract that answers a different class of questions than body chunks, and some retrieval systems weight first-chunk matches differently.

For PDF documents, the most useful additional metadata is the section heading of the chunk’s parent section, extracted during parsing. A chunk from page 47 under “Section 4.2: Indemnification” is far more retrievable for indemnification queries than the same text with metadata only recording page 47. Adding section headers to chunk metadata is straightforward with unstructured.io or LlamaParse during the ingestion step, and it consistently improves precision on structured document corpora in production settings.

Handling Special Content: Tables, Code, and Lists

Standard text splitting algorithms handle prose well but degrade badly on structured content. Tables are the most common problem: splitting a table across chunk boundaries produces two chunks each containing partial rows, neither of which is retrievable or coherent. The fix is to detect tables during parsing and treat each table as an atomic unit — never split across a table boundary regardless of size. For very large tables (more than 1024 tokens), either summarise the table into a text description for indexing while storing the full table for retrieval, or split by row groups with the column headers repeated in each chunk so each chunk is self-contained.

Code blocks in technical documentation have the same problem: splitting a code block mid-function produces an un-parseable fragment. Treat code blocks as atomic units during chunking — split before the code block begins or after it ends, never inside it. For documentation that contains many large code examples, a practical pattern is to generate two representations of each code block: the code itself (for retrieval when the query is about implementation) and a natural language summary of what the code does (for retrieval when the query is about concepts), and index both, linking them to the same parent document for generation.

Evaluating Chunking Quality

Chunking quality is best evaluated through its downstream effect on RAGAS context recall — the fraction of relevant information that appears somewhere in the retrieved chunks. But you can also evaluate chunking in isolation before running full retrieval, by inspecting three properties of your chunk set: coverage (does every part of every document appear in at least one chunk?), coherence (is each chunk a semantically self-contained unit — does it make sense to a reader with no surrounding context?), and size distribution (are chunks within an acceptable range — no very short stubs below 50 tokens that contain too little signal, and no chunks exceeding the embedding model’s context limit). Automated coverage is easy to verify; coherence requires human review of a random sample of 20–30 chunks per document type. This manual review step is worth doing before deploying to production — it catches systematic chunking failures like table splits, header-only chunks, and encoding artefacts that automated metrics miss entirely.

One hour of manual chunk inspection before launch consistently saves days of debugging retrieval failures after it — the failure modes that automated metrics miss are almost always visible to a human reader in the first 30 chunks reviewed. Budget that time before going to production. Treat chunking as a first-class engineering decision, not a default configuration choice.

Leave a Comment