This content originally appeared on DEV Community and was authored by PSBigBig
Embeddings bunch into a skinny cone, neighbors look the same for every query, recall craters. This is the classic space collapse that hides real semantics.
Problem types
- No.5 Semantic ≠ Embedding
- No.6 Logic Collapse and Recovery
What it looks like in practice
- Cosine similarity is high for almost everything, top-k lists barely change across different queries
- Near neighbors are dominated by boilerplate or global terms
- Recall\@k on a held out set drops after re-ingest or model swap
- IVF or HNSW shows busy lists with poor separation
60 second quick test
- Sample about 5k vectors from your store
- Compute per dimension variance and PCA explained variance ratio up to 50 components
- Plot PC1 and PC2, then check the cone
Rules of thumb
- Red flag if PC1 explained variance is above 0.70 or if the per dimension variance has Gini above 0.6
- Also bad if the median cosine to the centroid is above 0.55
Common root causes
- Model change without re whitening
- Mixed normalization states, some vectors L2 normalized and some not
- Truncation or unicode normalization bugs in the text pipeline
- Over aggressive stopword removal or duplicate boilerplate in documents
- FAISS metric set to inner product while vectors were already L2 normalized for cosine
- Shard trained on one domain, then queried with another
Minimal fix
Goal is to keep geometry isotropic enough for cosine to carry meaning.
- Mean center then L2 normalize all vectors again
- Whiten with a small rank
- Fit PCA on a random subset around 50k, pick r so that cumulative EVR sits between 0.90 and 0.98
-
Transform, then L2 normalize again
- Rebuild the index with a metric that matches the vector state
For cosine use an L2 index with normalized vectors
-
For inner product use IP and avoid double normalization
- Trash and re ingest any mixed state shards. Do not patch in place
You usually see recall recover right away after steps 1 and 3.
Harder fixes if the minimal path is not enough
- Domain de duplication and boilerplate masking before embedding. Keep per doc tf idf masks or a learned salience to damp headers, nav, legal text
- Subspace drop. If PC1 to PCk are topic or style axes, drop the k where EVR spikes, then renormalize
- Temperature up sampling for rare intents so neighbors reflect intent not frequency
- Metric sanity. Cosine with normalized vectors is often safer than IP on mixed magnitudes
- FAISS hygiene. Retrain IVF or PQ codebooks after geometry changes, avoid reusing old centroids
WFGY guardrails that help
- BBMC residual checks that flag geometry drift and trigger re whiten when the residue grows
- BBPF multi path retrieval to avoid single cone collapse during query expansion
- BBCR a bridge step when the chain stalls on near duplicates
- BBAM attention damping that prevents one token hijacks in long answers
Tiny scripts you can paste
Variance and cone checks
import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import normalize
X = np.load("sample_embeddings.npy") # shape [N, d]
X = X - X.mean(axis=0, keepdims=True)
X = normalize(X, norm="l2", axis=1)
p = PCA(n_components=min(50, X.shape[1])).fit(X)
evr = p.explained_variance_ratio_
print("PC1 EVR:", evr[0], "PC1..5 cum:", evr[:5].sum())
centroid = X.mean(axis=0, keepdims=True)
cos = (X @ centroid.T).ravel()
print("median cos to centroid:", float(np.median(cos)))
Whiten then renormalize
from sklearn.decomposition import PCA
from sklearn.preprocessing import normalize
import numpy as np, joblib
X = np.load("all_embeddings.npy")
mu = X.mean(0, keepdims=True)
Xc = X - mu
p = PCA(n_components=0.95, svd_solver="full").fit(Xc) # 95% EVR
Z = p.transform(Xc)
Z = normalize(Z, norm="l2", axis=1)
joblib.dump({"mu": mu, "pca": p}, "whitener.pkl")
np.save("embeddings_whitened.npy", Z)
FAISS rebuild sketch
import faiss, numpy as np
Z = np.load("embeddings_whitened.npy").astype("float32") # L2 normalized
d = Z.shape[1]
index = faiss.IndexHNSWFlat(d, 32)
index.hnsw.efConstruction = 200
faiss.normalize_L2(Z)
index.add(Z)
faiss.write_index(index, "hnsw_cosine.faiss")
Acceptance checks before you declare it fixed
- PC1 EVR at or below 0.35 and PC1 to PC5 cumulative at or below 0.70
- Median cosine to centroid at or below 0.35 after renormalization
- Neighbor overlap rate across 20 random queries at or below 0.35 for k equal to 20
- Recall on a held out set improves and top k varies with the query
TL;DR
Cone geometry means the space collapsed. Re center, whiten, renorm, rebuild. Then re check PC1 EVR and neighbor overlap. If your reasoning chain still stalls label it No.6 and insert a bridge step with BBCR.
Series index
All articles in this Problem Map series live here → ProblemMap Articles Index
This content originally appeared on DEV Community and was authored by PSBigBig

PSBigBig | Sciencx (2025-08-27T02:16:23+00:00) # Day 6 · Vector anisotropy and collapse (No.5, No.6). Retrieved from https://www.scien.cx/2025/08/27/day-6-%c2%b7-vector-anisotropy-and-collapse-no-5-no-6/
Please log in to upload a file.
There are no updates yet.
Click the Upload button above to add an update.