Stop Chunking Blindly: How Flat Splits Break Your RAG Pipeline Before It Even Starts

What’s Missing in 90% of RAG Pipelines (And Why It Matters)Image by AuthorRetrieval-Augmented Generation (RAG) has been sold as the cure to large language model forgetfulness. Take your knowledge base, slice it into chunks, embed them, and suddenly the…


This content originally appeared on Level Up Coding - Medium and was authored by Hammad Abbasi

What’s Missing in 90% of RAG Pipelines (And Why It Matters)

Image by Author

Retrieval-Augmented Generation (RAG) has been sold as the cure to large language model forgetfulness. Take your knowledge base, slice it into chunks, embed them, and suddenly the model can answer domain-specific questions with perfect recall.

That’s the story told in blog posts and slide decks. The reality, as anyone who has built a RAG pipeline for real documents knows, is more complicated.

Flat chunking — the act of splitting a document into arbitrary windows of 512 or 1000 tokens — works just well enough to impress in a demo. But as your corpus grows beyond toy examples, the cracks begin to show. Answers are partial. Contexts are mismatched. Hallucinations creep in.

The problem isn’t with embeddings or models. The problem is upstream. We are throwing away too much information when we flatten a rich, structured document into dumb slices.

In this article, I’ll explore how we can do better — by respecting document structure, enriching passages with metatags, and refining text before embedding. Together, these three practices can turn a brittle retrieval pipeline into one that feels precise, grounded, and almost human in how it navigates information.

But before we climb toward solutions, let’s spend time with the old default: flat chunking.

The Convenient Trap of Flat Chunking

Flat chunking is the bread and butter of today’s RAG systems. A document is chopped into slices of a fixed size, each slice becomes an embedding, and when a user asks a question, the retriever scans for the top-k closest chunks. Simple. Easy. “Good enough.”

Or so it seems.

Why did it become dominant? Because it aligns perfectly with the ecosystem. Embedding APIs encourage fixed token limits. Vector databases are designed for uniform vectors. Every RAG tutorial online repeats the same recipe, so new practitioners assume it must be best practice.

But flat chunking is a trap. Its convenience hides three chronic failures:

  1. Loss of semantic boundaries: chunks ignore headings, section titles, and natural breaks.
  2. Fragmented answers: content spanning two chunks may never be retrieved together.
  3. Noise and padding: arbitrary slices drag irrelevant text along, diluting meaning.

A Compliance Example

Suppose you are embedding a 200-page regulation. A user asks:

“What are the exceptions to quarterly filing requirements?”

Flat chunk retrieval finds the word “exceptions” in a chunk. Unfortunately, that chunk belongs to a section about subsidiary reporting, not quarterly filings. The LLM confidently hallucinates an incorrect answer.

The failure isn’t in the model’s reasoning. It’s in the retrieval layer serving it the wrong context.

The Shape of Documents: An Untapped Signal

Image by Author

Humans don’t read like this. We don’t ingest a book as arbitrary 500-word blocks. We rely on structure.

We skim tables of contents. We jump to sections and subsections. We know whether a passage is a definition, a procedure, or an appendix, because the layout tells us.

Structure is a built-in navigational system. Ignoring it is like ripping the street signs off a city map and then asking people to drive around.

Here are practical ways to do it:

Use Document Skeletons

PDFs, Word docs, HTML pages, and Markdown files usually carry clear markers: headings, lists, paragraphs, and captions. Instead of ignoring those and splitting by raw token count, parse them and chunk by section. Tools like pdfplumber, PyMuPDF, or HTML DOM parsers can pull out these boundaries.

This way, an entire section — for example “Section 2.3 — Payment Terms” — stays together. You don’t end up with half a clause in one chunk and half in another, which makes retrieval and answers less coherent.

Add Guardrails

Sometimes sections are too long. A single appendix or a transcript can run into thousands of tokens. In those cases, you split on the heading first, then apply a secondary token cap — often around 1,000 tokens — to stop runaway sizes.

This hybrid approach means you keep natural boundaries intact while still controlling for cost, latency, and embedding limits.

Create Boundaries When None Exist

Not all text comes with structure. Meeting notes, logs, and transcripts can be a wall of words. In these cases, you can use a small LLM to segment the text into self-contained idea units.

A simple prompt works: “Split this text into segments where each expresses one complete idea, with each segment under ~1,000 tokens.” The result is cleaner and more meaningful than a raw sliding window.

Adapt to the Domain

Different domains demand different splitting rules. A legal contract is best split by clauses and subclauses. A medical report breaks naturally into diagnosis, history, treatment, and follow-up. A technical document works well when split by function, class, or module boundaries.

Regexes, parsers, or lightweight rules can capture these signals without overcomplicating the pipeline.

Keep the Labels (Important)

Once you’ve chunked, don’t throw away the structure. Store the section heading, page number, or document title as metadata alongside the embedding.

This makes retrieval far more powerful. Instead of ranking only by vector similarity, you can filter or rerank based on where a passage came from; for example, preferring “Conclusion” sections when a user asks for a summary.

Beyond Structure: Metatags and Refinement

Even with structure, embeddings only see text. They don’t know that a passage comes from Appendix C versus the Executive Summary — unless you tell them. That’s where metatag enrichment enters.

Metatags are labels like:

  • Source: Finance_Regulations_2024.pdf
  • Section: §3.4 Exceptions
  • Category: Compliance > Filing
  • Date: 2024–01–10
  • Quality: OCR-cleaned

These signals let you filter at retrieval time: “only 2024 regulations,” “just appendices,” “ignore draft versions.”

Refinement adds another layer: cleaning and normalizing text before embedding. Stripping boilerplate, aligning segmentation with paragraphs, prefixing headings — these are simple but high-impact steps.

Together, refinement + structure + metatags produce embeddings that are sharper, cleaner, and contextually richer than raw slices.

Enter the Naive Alternative: Dump the Whole Document

At this point, some readers might interject: “Wait. Why not just skip retrieval altogether? Models are getting million-token context windows. Why not throw the whole document in and let the LLM figure it out?”

It’s a seductive idea. A few blog posts even argue that RAG will become obsolete as windows grow larger. Why bother indexing, chunking, or tagging when you can just dump everything into the prompt?

Here’s why that’s misguided.

The Illusion of Infinite Context

  1. Attention dilution
    Transformers scale poorly. The more tokens you feed, the thinner the model’s attention gets. A million tokens doesn’t mean a million tokens of focus — it means a million tokens of clutter.
  2. Latency and cost
    Feeding 500,000 tokens is not cheap. You pay for every character in compute, time, and memory. For production use, costs spiral.
  3. No internal navigation
    A model doesn’t “jump to section 3.4” the way a human does. Without explicit cues, it must scan the entire blob. That scanning is brittle, error-prone, and inconsistent.
  4. Update rigidity
    Documents evolve. If you feed the whole thing each time, you resend redundant material. Retrieval pipelines let you update and serve only what changed.
  5. Pollution risk
    Noisy sections, OCR errors, or irrelevant appendices all flood the prompt. The model now has to wade through junk.

Full Document Dump vs. Retrieval

Image by Author
Dumping is brute force. Retrieval is surgical.

Lost in the Middle: When Even Infinite Context Isn’t Enough

Even if you could feed the entire document into the model, that doesn’t guarantee the model will find the correct passage — especially if the answer lives in the middle of that input. This phenomenon is known in research as the “lost in the middle” problem.

What is “Lost in the Middle”?

A paper titled Lost in the Middle: How Language Models Use Long Contexts empirically shows that many LLMs struggle when the relevant information is placed in the middle of a long context.

Even models with extended context support show a U-shaped performance curve: they do well when the relevant content is near the start or near the end, but performance degrades sharply when the content is in the middle.

Here’s the intuition: models tend to show primacy bias (favoring early tokens) and recency bias (favoring late tokens), and they struggle to robustly use information residing in the middle.

Key Findings from the Paper

  • They run experiments on multi-document QA and key-value retrieval tasks, controlling where the relevant passage sits (start, middle, end).
  • They show performance falls when the passage is pushed into the middle, even when the model has capacity for long context. arXiv+1
  • They also test different architectures and prompt formats to see how robust models are to positional shifts. arXiv
In short: the presence of all tokens is not enough — how the model treats positions matters.

Why this matters for RAG systems

The “lost in the middle” problem is dangerous for RAG because many systems do rely on giving long context or pushing many retrieved chunks. If the correct chunk ends up in a “middle-of-prompt” position, the model might ignore it — even if it was included.

This adds another layer of brittleness to the “dump entire doc” idea: even if you include the correct text, the model might underweight it. In contrast, retrieval + reranking + structure makes sure the relevant content is placed near the frontier where primacy/recency biases favor it.

Reranking: The Precision Layer We Forget

We’ve already discussed why dumping whole documents into context fails and why structure plus metatags give retrieval its backbone. But there’s still one more layer that makes retrieval pipelines shine: reranking.

What is Reranking?

Reranking is the practice of taking an initial batch of candidate passages (say, top-30 chunks from your vector database) and then re-ordering them with a more precise model.

The first retrieval step is about recall: find enough potentially relevant chunks.
The reranking step is about precision: sort those chunks so the best ones float to the top.

Think of it like a funnel:

Initial retrieval (fast embeddings):
--------------------------------------------
Query → Top-30 candidates (approximate match)

Reranking (precision model):
--------------------------------------------
Top-30 → Re-score relevance → Top-5 final context

How to Implement Reranking

There are three main approaches in practice:

1. Using LLMs as rerankers

Some teams simply ask the LLM itself to judge relevance. For each candidate passage, they prompt the LLM:

“Given the user query and this passage, rate how relevant the passage is to answering the query on a scale of 0–5.”

This works surprisingly well, but it’s slow and expensive, since each rerank involves multiple model calls. It’s suitable for high-value queries, but not scalable for millions of lookups per day.

2. Cross-Encoders (Specialized Reranking Models)

A more efficient and specialized option is to use cross-encoders — models fine-tuned for reranking tasks.

  • Unlike bi-encoders (embeddings) that embed query and passage separately, a cross-encoder takes query + passage together and produces a single relevance score.
  • This means it can capture fine-grained interactions between query terms and passage tokens.

Popular reranking models:

  • MS MARCO-trained cross-encoders (e.g., MiniLM, BERT-based rerankers)
  • ColBERT (late interaction model balancing cost vs. quality)
  • NVIDIA’s LoRA-tuned reranker models for RAG pipelines
  • Zilliz “hybrid reranker” combining BM25 sparse features with dense embeddings
These models are smaller than GPTs, but optimized to do one job really well: reorder text candidates by relevance.

3. Hybrid Reranking Pipelines

Reranking alone gives you a big jump in quality, but you can go further. Instead of relying on a single signal, the strongest systems blend multiple retrieval signals before reranking. That way, the reranker isn’t just polishing the same rough set — it’s working with a richer pool of candidates.

Start with a richer candidate pool

Begin by pulling two views of the corpus in parallel. One view is semantic, built from embeddings, which cares about meaning even when the wording changes. The other view is lexical, built from a classical scorer like BM25, which cares about the exact words that appear. These two views rarely agree perfectly. That is the point. By retrieving from both, you raise the ceiling on what can be found.

For instance, when a user asks:

"What are the exceptions to quarterly filing?"

Two retrieval views run in parallel.

Semantic (Embeddings) finds passages that are close in meaning, even if the wording is different:

- "Companies may waive quarterly statements if revenue is below $1M annually."
- "Early termination allows skipping quarterly filings in special cases."
- "Exemptions apply to subsidiaries under consolidated reporting."

Lexical (BM25 / keyword) finds exact text matches for the query words:

- "Section 3.4: Exceptions to Quarterly Filing"
- "Quarterly filing exceptions are outlined below."
- "No exceptions to quarterly filing unless explicitly stated in section 3.4."

The system merges these candidates into a combined pool.

Now the reranker has more to work with: broad semantic hits that capture related ideas and precise lexical matches that lock onto clause numbers and key phrases.

Dense retrieval captures meaning

Embeddings are your net for everything the user might mean but did not quite say. A query about ending a contract early can surface passages about early termination, cancellation before expiry, or opt-out clauses. This is recall in the true sense. You are not limited to exact phrases. You are fishing in the neighborhood of the idea. The cost of that power is noise. Dense retrieval will happily bring material that sounds related but does not answer the question directly. That is why it must be paired.

Query:

"How do I terminate a contract early?"

Embedding hits (semantic neighbors):

- "Early cancellation of services is permitted under clause 5."
- "Termination before the contract expiry requires 30 days’ notice."
- "Opt-out options are available after the first year of service."

Dense search widens the net. It catches paraphrases and related phrases even when “terminate early” isn’t spelled out. That’s recall.

Sparse retrieval preserves exact signals

BM25 and other sparse methods reward literal matches. When a user asks for Section 14.2, dense retrieval might treat the number as incidental. Sparse retrieval will not. It will bring the clause that names it. The same is true for SKUs, IDs, dates, error codes, and any query where the string itself matters. Sparse retrieval is brittle for paraphrases, but it is unbeatable when the words are the intent.

Query:

"How do I terminate a contract early?"

BM25 hits (keyword matches):

- "Section 5.2 – Terminate contract early."
- "Termination clause for early exit."
- "No early termination allowed unless fee is paid."

Sparse search doesn’t care about synonyms. It finds the literal “terminate contract early.” That’s precision for exact wording — IDs, section numbers, and phrases that embeddings may gloss over.

Metadata trims the candidate set

Not every clue lives in the body text. Some of the most helpful constraints are outside the passage. If a question asks for the current policy, you should not be ranking drafts from five years ago. If a question is about procedures, you should not be ranking marketing pages. This is where metadata filters do quiet but decisive work. By pruning on attributes like document type, section label, author, source system, jurisdiction, or last-modified date, you lower the noise floor before the reranker even runs.

Query:

"What are the current rules for data retention?"

Candidates before filtering:

- "Data retention policy, 2018 draft."   (outdated)
- "Finalized retention rules, updated May 2024." (relevant)
- "Test doc – retention policy sandbox." (irrelevant)
- "Data retention policy, 2018 draft."   (outdated)
- "Finalized retention rules, updated May 2024." (relevant)
- "Test doc – retention policy sandbox." (irrelevant)

After metadata filter:

- "Finalized retention rules, updated May 2024."

The reranker is the judge, not the fisherman

Once you have a merged, filtered set, you need a judge. A cross-encoder or a lightweight LLM reranker looks at the query and each candidate together and assigns a relevance score. It does not ask, “does this look similar in vector space?” It asks, “does this specific passage actually answer this specific question?” That difference is why rerankers reliably reorder middling lists into strong shortlists.

Query:

"What are the exceptions to quarterly filing?"

Candidate pool (from embeddings + BM25):

1. "Companies may waive quarterly statements if revenue is below $1M annually."
2. "Quarterly filing exceptions are outlined below." ← exact hit
3. "Section 3.4: Exceptions to Quarterly Filing." ← exact hit
4. "Exemptions apply to subsidiaries under consolidated reporting."
5. "No exceptions unless explicitly stated in section 3.4."

Reranked output (top 3):

1. "Section 3.4: Exceptions to Quarterly Filing."
2. "Quarterly filing exceptions are outlined below."
3. "Exemptions apply to subsidiaries under consolidated reporting."

The reranker reads the query against each passage and reorders based on actual relevance. It pushes the true “exceptions clause” to the top, instead of leaving it buried.

The Blend

Each stage does its job:

  • Dense -> “don’t miss the paraphrases.”
  • Sparse -> “don’t miss the literal hits.”
  • Metadata -> “keep only the right documents.”
  • Reranker -> “decide which ones actually answer the query.”

Together, they make retrieval broad and sharp — coverage plus accuracy.

The Tradeoffs

  • Latency: Reranking adds extra milliseconds or seconds depending on candidate size.
  • Cost: Specialized models are cheaper than LLM rerankers but still cost more than embeddings.
  • Engineering complexity: You need two stages of retrieval, not one.

But the benefits outweigh these tradeoffs in domains where wrong answers are costly — compliance, healthcare, enterprise search, or customer-facing chat.

Reranking isn’t just a nice-to-have, it’s risk control. In compliance, a missed clause could expose you to fines; in customer support, the wrong troubleshooting step frustrates users and drives churn; in healthcare, a generic drug answer instead of the renal-impairment warning is a safety risk; in research or enterprise search, drowning people in “close enough” wastes time and credibility. Without reranking you’re handing customers almost-right answers — with it, you deliver the precise ones that protect trust and the business.

Building a Smarter Retrieval Flow

So what should a modern RAG pipeline look like? Not a conveyor belt of steps, but a layered flow that keeps meaning alive from start to finish.

Image by Author

It begins with refinement, where raw documents are cleaned and normalized so noise doesn’t poison retrieval later. Structure parsing keeps the natural skeleton of the text — sections, headings, tables — instead of flattening everything into token windows. Metatag enrichment adds labels like dates or authorship that let you tell lookalike passages apart.

From there, retrieval becomes a hybrid process. Embeddings capture meaning, sparse methods catch exact matches, metadata trims the noise, and a reranker decides which candidates truly answer the query. Answer generation then works on a precise, reliable shortlist — with citations attached so users can trust the output.

This is the shift: from throwing documents into a model and hoping for the best, to building a retrieval flow that acts like infrastructure. Clean inputs, structured signals, hybrid retrieval, and reranking all working together. That’s what makes RAG systems dependable in practice, not just in demos.

Conclusion: Keep It Simple, Keep It Smart

The mistakes are clear. Flat chunking throws away structure. Dumping entire documents overwhelms models and wastes compute. Big context windows don’t erase position bias — they make it worse.

The fixes are not magic. They’re practical:

  • Clean your documents before embedding.
  • Keep their natural structure.
  • Add tags that carry extra meaning.
  • Don’t shove everything in the prompt — pick what matters.
  • Use reranking to sharpen what you send.

If you build retrieval systems without these layers, you leave a lot on the table: answers that feel plausible but are ungrounded, hallucinations that arise because your model never saw the right piece, or worst, giving users wrong information with confidence.

If there’s one thing I hope you take away: more tokens don’t guarantee better answers. What matters is making sure the model sees the right tokens, in the right order, with supporting signals to guide it.

I write regularly about LLM internals, agent orchestration, RAG pipelines, and enterprise architecture — the systems side of AI. You’ll find all my articles here on Medium, through the newsletter, and collected at hammadabbasi.com/blogs.


Stop Chunking Blindly: How Flat Splits Break Your RAG Pipeline Before It Even Starts was originally published in Level Up Coding on Medium, where people are continuing the conversation by highlighting and responding to this story.


This content originally appeared on Level Up Coding - Medium and was authored by Hammad Abbasi


Print Share Comment Cite Upload Translate Updates
APA

Hammad Abbasi | Sciencx (2025-10-10T17:03:14+00:00) Stop Chunking Blindly: How Flat Splits Break Your RAG Pipeline Before It Even Starts. Retrieved from https://www.scien.cx/2025/10/10/stop-chunking-blindly-how-flat-splits-break-your-rag-pipeline-before-it-even-starts/

MLA
" » Stop Chunking Blindly: How Flat Splits Break Your RAG Pipeline Before It Even Starts." Hammad Abbasi | Sciencx - Friday October 10, 2025, https://www.scien.cx/2025/10/10/stop-chunking-blindly-how-flat-splits-break-your-rag-pipeline-before-it-even-starts/
HARVARD
Hammad Abbasi | Sciencx Friday October 10, 2025 » Stop Chunking Blindly: How Flat Splits Break Your RAG Pipeline Before It Even Starts., viewed ,<https://www.scien.cx/2025/10/10/stop-chunking-blindly-how-flat-splits-break-your-rag-pipeline-before-it-even-starts/>
VANCOUVER
Hammad Abbasi | Sciencx - » Stop Chunking Blindly: How Flat Splits Break Your RAG Pipeline Before It Even Starts. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2025/10/10/stop-chunking-blindly-how-flat-splits-break-your-rag-pipeline-before-it-even-starts/
CHICAGO
" » Stop Chunking Blindly: How Flat Splits Break Your RAG Pipeline Before It Even Starts." Hammad Abbasi | Sciencx - Accessed . https://www.scien.cx/2025/10/10/stop-chunking-blindly-how-flat-splits-break-your-rag-pipeline-before-it-even-starts/
IEEE
" » Stop Chunking Blindly: How Flat Splits Break Your RAG Pipeline Before It Even Starts." Hammad Abbasi | Sciencx [Online]. Available: https://www.scien.cx/2025/10/10/stop-chunking-blindly-how-flat-splits-break-your-rag-pipeline-before-it-even-starts/. [Accessed: ]
rf:citation
» Stop Chunking Blindly: How Flat Splits Break Your RAG Pipeline Before It Even Starts | Hammad Abbasi | Sciencx | https://www.scien.cx/2025/10/10/stop-chunking-blindly-how-flat-splits-break-your-rag-pipeline-before-it-even-starts/ |

Please log in to upload a file.




There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.