This content originally appeared on DEV Community and was authored by PSBigBig
most teams flip on a shiny reranker and the offline chart jumps. then real traffic arrives and the lift melts. if the base space is unhealthy, a reranker only hides the pain. this writeup is the minimal path to prove that, fix the base, then keep reranking as light polish.
a quick story to set context
we had a product faq bot. cross-encoder reranker looked great on 30 handpicked questions. in prod, small paraphrases flipped answers. reading traces showed citations pointed to generic intros, not the exact span. turning off rerank exposed the truth. the raw top-k almost never covered the right section. geometry was wrong. chunks were messy. we were living in No.5 and occasionally No.6 when synthesis tried to “fill in” gaps.
60 second ablation that tells you the truth
run the same question twice
1.1 retriever only
1.2 retriever then rerankerrecord three numbers
coverage of the target section in top-k
ΔS(question, retrieved)
citations per atomic claimlabel
low coverage without rerank that “magically” improves only after rerank → No.5 Semantic ≠ Embedding
coverage ok but prose still drifts or merges extra claims → No.6 Logic Collapsestability
ask three paraphrases. if labels or answers alternate, the chain is unstable. reranker is masking the base failure.
rules of thumb
coverage before rerank ≥ 0.70
ΔS ≤ 0.45 for stable chains
one valid citation per atomic claim
what overreliance looks like in traces
- base top-k rarely contains the true span. reranker promotes “sounds right” text
- small header or boilerplate chunks dominate retrieval candidates
- cosine vs L2 setup is mixed across shards. norms inconsistent
- offline tables show nice MRR but human readers cannot match citations to spans
- with rerank off, answers alternate across runs on paraphrases
- model “repairs” missing evidence instead of pausing for it
root causes to check first
- metric and normalization mismatch between corpus and queries
- chunking to embedding contract missing. no stable snippet id, section id, offsets
- vectorstore fragmentation. near-duplicates split the same fact across ids
- reranker objective favors generic summaries over tight claim-aligned spans
- eval set is tiny and biased toward reranker behavior
minimal fix path
goal: make the base space trustworthy, then keep reranking as a gentle, auditable layer.
- align metric and normalization keep one metric policy across build and query. for cosine style retrieval, L2-normalize both sides and use a consistent index.
from sklearn.preprocessing import normalize
Z = normalize(Z, axis=1).astype("float32") # corpus
Q = normalize(Q, axis=1).astype("float32") # queries
enforce the chunk → embed contract
mask boilerplate, keep window sizes consistent with your model, emitsnippet_id, section_id, offsets, tokens
.add a coverage gate before rerank
if base coverage is below 0.70, do not rerank. return a short bridge plan that asks for a better retrieval pass or more context.
def coverage_ok(candidates, target_ids, k=10, th=0.70):
hits = sum(1 for i in candidates[:k] if i in target_ids)
denom = max(1, min(k, len(target_ids)))
return hits / float(denom) >= th
- lock cite-then-explain fail fast when any claim lacks in-scope citations.
def per_claim_ok(payload, allowed):
bad = [i for i,c in enumerate(payload)
if not c.get("citations") or not set(c["citations"]) <= set(allowed)]
return {"ok": not bad, "bad_claims": bad}
- keep reranking for span alignment only prefer claim-aligned spans over generic summaries. record rerank scores next to citations for auditing.
when minimal is not enough
- rebuild the index from clean embeddings with a single metric policy
- retrain IVF or PQ codebooks after dedup and boilerplate masking
- collapse near-duplicates before indexing
- add a sparse leg and fuse simply when exact terms matter
- if you must cross-encode, cap its influence and keep the base candidate set healthy
tiny utilities you can paste
base vs rerank lift
def lift_at_k(gt_ids, base_ids, rr_ids, k=10):
base_hit = int(any(x in gt_ids for x in base_ids[:k]))
rr_hit = int(any(x in gt_ids for x in rr_ids[:k]))
return {"base_hit": base_hit, "rr_hit": rr_hit, "lift": rr_hit - base_hit}
neighbor overlap sanity
def overlap_at_k(a_ids, b_ids, k=20):
a, b = set(a_ids[:k]), set(b_ids[:k])
return len(a & b) / float(k) # healthy spaces sit well below 0.35
minimal ΔS probe
import numpy as np
def delta_s(q, r):
q = q / np.linalg.norm(q)
r = r / np.linalg.norm(r)
return float(1.0 - np.dot(q, r))
acceptance before you call it fixed
- base top-k covers the target section at 0.70 or higher
- ΔS at or below 0.45 across three paraphrases
- every claim has an in-scope citation id
- reranker provides positive lift without being required for correctness
tldr
rerankers are polish, not crutches. fix metric and normalization, fix chunk contracts, demand coverage and citations, then let the reranker nudge spans into place. call it No.5 when geometry is wrong, and No.6 when synthesis still drifts after coverage is healthy.
full writeup and the rest of the series live here
Problem Map article series
This content originally appeared on DEV Community and was authored by PSBigBig

PSBigBig | Sciencx (2025-08-30T01:31:38+00:00) Day 9 · Overreliance on reranker (No.5, No.6). Retrieved from https://www.scien.cx/2025/08/30/day-9-%c2%b7-overreliance-on-reranker-no-5-no-6/
Please log in to upload a file.
There are no updates yet.
Click the Upload button above to add an update.