This content originally appeared on HackerNoon and was authored by Ritvik Pandya
\
We Were Fast, But Not Fast Enough
Our platform was already "optimized" by most standards. Kubernetes. Istio service mesh. Horizontal autoscaling. Tuned JVM heap sizes. We could push well over a thousand transactions per second with respectable median latency.
But the P95 told a different story.
Every few hundred milliseconds, a request would take three, four, sometimes ten times longer than the median. On a dashboard, those spikes look like noise. In production, they cascade. A slow request holds a thread. A held thread starves the pool. A starved pool triggers upstream timeouts. Timeouts trigger retries. Retries amplify load. Suddenly your "one slow request" has become a system-wide latency event.
When you operate a platform where transactions are measured in single-digit milliseconds and your SLAs leave no room for percentile bloat, you can't dismiss tail latency as statistical variance. You have to hunt it down.
We spent weeks profiling. Query plans were clean. Connection pools were sized correctly. Garbage collection pauses were minimal. The application code wasn't the problem.
The routing layer was.
Why Random Routing Fails at Scale
Most distributed systems use some variant of random load balancing — round-robin, least-connections, power-of-two-choices, or weighted random. The underlying assumption is that all backend nodes are interchangeable. Send any request to any node, and performance should be roughly the same.
That assumption holds when your system is small or when your data access patterns are uniform. It breaks the moment either condition changes.
Here is what we observed under load:
Cache miss amplification. Each application node maintains a local cache — recent lookups, connection metadata, session state. When requests are distributed randomly, a given entity's requests scatter across every node in the cluster. No single node builds a warm cache for that entity. Every request has a meaningful probability of hitting a cold cache path, which means a round trip to the database instead of a sub-millisecond local lookup.
At a thousand TPS, if you have 20 nodes and your cache hit ratio drops from 95% to 70% because of random scattering, you've just added roughly 300 additional database round trips per second. Each one adds 2–5ms. That's your P95 spike.
Connection pool thrashing. In a partitioned database (we use a distributed SQL store), each application node maintains connection pools to multiple database nodes. With random routing, every application node needs active connections to every database node. Most of those connections sit idle most of the time, but they consume memory, file descriptors, and TCP keepalive overhead. Meanwhile, the few connections that are actually hot get contended under burst traffic.
The result is a paradox: you have hundreds of open connections, but the ones you need are busy.
Statistical clustering under load. Random doesn't mean uniform, especially over short time windows. At high TPS, random routing creates momentary hot spots — two or three requests for the same partition landing on the same node within the same millisecond. These micro-bursts cause queuing at the node level, and queuing is the single fastest way to inflate tail latency.
The core insight was this: the routing layer and the data layer were making independent decisions. The router didn't know where the data lived. The database didn't know where the requests were coming from. That misalignment was where latency was hiding.

\
The Insight: Align Routing With Data Locality
The fix was conceptually simple: make the routing layer aware of the data topology.
Instead of distributing requests randomly across nodes, we introduced deterministic node affinity. The routing logic computes a hash of the request's primary key — the stable identifier that uniquely ties a transaction to a data partition — and uses modular arithmetic to map it to a specific application node:
target_node = hash(primary_key) % total_nodes
Every request for the same entity always lands on the same node.
This is not application-level sharding. The application remains stateless. Any node can handle any request — the routing preference is an optimization, not a hard constraint. If the target node is down, the request falls back to the next node in a consistent hash ring. But under normal operation, the routing layer ensures that requests for a given entity converge on a single node.
Why does this work so well for transactional platforms? Because transactions are keyed on stable identifiers — account IDs, payment references, customer tokens. These keys don't change mid-flight. They're known at the ingress boundary before the request even reaches the application layer. That makes hash-based affinity both predictable and cheap to compute.
The critical distinction: we didn't shard the application — we sharded the routing. The application topology stayed flat. Deployments stayed simple. We just made the path from ingress to data more direct.
How We Built It
The implementation lives at the ingress layer — the point where external requests enter the platform. This is where routing decisions have the most leverage, because every downstream hop inherits the initial placement.
Hash computation at ingress. The ingress controller extracts the primary key from the request (typically from a header or path parameter) and computes a consistent hash. This happens before the request touches any application pod. The overhead is negligible — a single hash operation adds microseconds, not milliseconds.
If you're running Istio, you can get most of this with a native DestinationRule — no custom ingress code required:
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
name: transaction-service
spec:
host: transaction-service
trafficPolicy:
loadBalancer:
consistentHash:
httpHeaderName: x-routing-key
Ensure the routing key (account ID, payment reference, or a hash of it) is available as a header. If the key lives in the path or request body, inject it via an EnvoyFilter (Lua or Wasm) at the gateway. Under normal conditions, the same key always lands on the same pod. If the preferred pod is unavailable, Istio falls back gracefully. Consistent hashing minimizes reshuffling when pods are added or removed — only ~1/N of keys migrate.
Worker pool affinity. Behind the ingress layer, we structured the application's worker threads with node-level affinity. Each worker thread is mapped to a subset of the total node pool. When a request arrives, it is dispatched to the worker thread that "owns" the target node's partition range. This means the thread already has warm caches and active connections for that partition.
This is the part that's easy to underestimate. Thread-level affinity isn't just a performance optimization — it's a contention elimination strategy. When threads don't share partition responsibility, they don't compete for the same cache entries, the same connection pool slots, or the same mutex locks.
Integration with Kubernetes topology. We use node labels and pod anti-affinity rules to ensure that the application's logical node assignments align with physical pod placement. This prevents a scenario where two logically distinct partitions end up on the same physical node, sharing CPU and memory resources in ways that reintroduce the variance we're trying to eliminate.
Handling edge cases. Three scenarios required careful design:
Rolling deployments. When pods are replaced during a deployment, the node count temporarily changes. We use a consistent hashing ring rather than raw modular arithmetic, so that adding or removing a node only redistributes a minimal fraction of keys (roughly 1/N of the total).
Node failures. When a node becomes unavailable, its partition range is redistributed to adjacent nodes in the hash ring. Because the reassignment is deterministic, the receiving nodes immediately start building warm caches for the migrated keys. Recovery is fast because it's targeted — only the affected partition's traffic moves, not the entire cluster's.

\ Skewed keys. Some entities generate disproportionate traffic. We maintain a small hot-key table at the ingress layer that overrides the hash function for known high-volume keys, distributing them across multiple nodes with a secondary hash. This adds complexity, but it prevents a single hot key from destabilizing an entire node.
The Second Hop: Aligning With the Data Layer
Everything described so far operates at the application tier. The ingress controller routes requests to application pods. The application pods serve responses. This is the right scope for the ingress layer — it should know about pods, not about databases.
But there's a second routing decision happening downstream, and if you ignore it, you leave latency on the table.
Modern distributed SQL systems like CockroachDB don't use static partition-to-node mappings. They split data into ranges (typically ~64MB each) and assign each range a leaseholder — the single replica node that serves all consistent reads and coordinates all writes. The leaseholder is the node that matters. If your application pod's database query reaches the leaseholder directly, the read resolves locally. If it reaches a different replica, the query bounces across the cluster — an extra network hop that can easily add 2–5ms.
This is a separate problem from application-tier routing, and it requires a separate solution. Critically, it does not belong at the ingress layer.

\ Why the ingress layer should not own leaseholder awareness. It's tempting to extend the ingress hash to account for database topology — "just route requests to the pod closest to the leaseholder." But this violates separation of concerns in ways that hurt operationally:
The ingress controller's job is HTTP/gRPC routing to pods. It knows about hostnames, headers, paths, TLS, and load balancing policies. The moment you ask it to also track database range metadata, leaseholder placement, and split events, you've coupled two independent systems. A database rebalance event at 2 AM would silently invalidate your ingress routing assumptions. A range split would require the ingress layer to react in near-real-time to database internals it was never designed to understand.
You'd also need every ingress replica to maintain a synchronized metadata cache of range-to-leaseholder mappings — adding complexity, potential inconsistency across replicas, and a new failure mode where stale metadata degrades performance worse than no optimization at all.
The right boundary: two layers, two owners.
Layer 1: Ingress → Application pod. The ingress controller hashes the primary key and routes to a stable application pod. This gives you cache locality, connection pool reuse, and thread affinity. The ingress knows nothing about the database. This layer is simple, static, and changes only when pods scale.
Layer 2: Application pod → Database node. The application layer owns leaseholder awareness. The database client inside each pod maintains a lightweight metadata cache that tracks which database node currently holds the leaseholder for each key range. When the application issues a SQL query, it preferentially routes through the gateway node co-located with the leaseholder — using the database driver's built-in node locality preferences.
This is a natural extension of the database client's responsibility. The application already holds connection pools. It already knows the query's key. The database driver already has mechanisms for gateway node selection. Leaseholder-aware routing at this layer is configuration, not custom infrastructure.
Making leaseholder placement predictable. Distributed SQL systems let you influence leaseholder placement through zone configurations. You can set lease preferences to pin leaseholders for specific tables or index ranges to specific regions or availability zones:
ALTER TABLE transactions CONFIGURE ZONE USING
lease_preferences = '[[+region=us-west-2a]]';
This makes the application's gateway routing stable — if leaseholders for partition range A are always in us-west-2a, the application pod in that zone routes queries to the local gateway node with confidence.
Without lease pinning, the database engine rebalances leaseholders autonomously based on load and replica health. That rebalancing is good for database-level self-healing, but it introduces drift between the application's routing expectations and the actual leaseholder location. Lease preferences give you a stable baseline; the database client's metadata cache handles the delta.
Handling range splits and rebalancing. Distributed SQL systems dynamically split ranges as data grows. When a range splits, a key space that was served by one leaseholder is now served by two. The application's database client detects this transparently — a query that hits a non-leaseholder returns a redirect, and the client updates its internal routing table. This is the database driver's built-in behavior, not something you build yourself.
For the rare case where stale metadata causes a query to reach the wrong node, the penalty is one extra hop — not a system failure. The client corrects on the next attempt. This self-healing property is why leaseholder awareness belongs in the database client, not in a custom metadata cache bolted onto the ingress layer.
Follower reads as a strategic bypass. Not every query needs to hit the leaseholder. For read-heavy workloads where slight staleness is acceptable (typically a few seconds behind), distributed SQL systems offer follower reads — any replica can serve the read without coordinating with the leaseholder:
-- Exact staleness: uses closest replica automatically
SELECT * FROM transactions
AS OF SYSTEM TIME follower_read_timestamp()
WHERE account_id = ?;
-- Bounded staleness: guarantees freshness within window
SELECT * FROM transactions
AS OF SYSTEM TIME with_max_staleness('5s')
WHERE account_id = ?;
This means your leaseholder routing optimization only needs to be precise for writes and strongly-consistent reads. For everything else, any replica works.
We split our strategy accordingly: leaseholder-aware gateway routing for writes and consistent reads, and locality-based round-robin to the nearest replica for follower reads. The exact split depends on your workload's read/write ratio and consistency requirements — there's no universal number.
The takeaway: two layers of affinity, clean separation. The ingress hash gives you application-level affinity — cache locality, connection reuse, thread affinity. The database client gives you data-level affinity — minimizing the hop between your query and the node that holds the data. Each layer optimizes within its own scope. Neither needs to know about the other's internals. This separation means you can change your database topology without touching ingress, and change your pod topology without worrying about leaseholder alignment.
The Results
We rolled this out incrementally — first to a single region, then globally — measuring at every stage.

\ ~29% P95 latency reduction. The primary driver was cache hit ratio improvement. With deterministic routing, each node's local cache became highly effective for its assigned partition range. Cache hit rates went from roughly 70% under random routing to over 94% with affinity. That eliminated hundreds of database round trips per second.
~2.1× throughput improvement. Same pod count, same infrastructure. The efficiency gains came from three sources: fewer cache misses, less connection pool contention, and reduced thread-level mutex contention. Each node was doing less redundant work, which meant more useful work per CPU cycle.
~42% faster failover recovery. This was the result we didn't fully anticipate. Under random routing, when a node failed, the system had to redistribute traffic globally and rebuild caches everywhere. Under deterministic routing, only the failed node's partition range needed redistribution. The receiving nodes started with empty caches for those keys, but they were only rebuilding a fraction of the total working set. Recovery time dropped from minutes to tens of seconds.
~$1.2M annualized infrastructure cost savings. Higher utilization per node meant we needed fewer nodes to meet the same SLA. We retired roughly 30% of our over-provisioned capacity without degrading any performance metric. In a cloud environment where every pod-hour costs money, that adds up fast.
\

\ \ These are production numbers under real transactional traffic, not synthetic benchmarks. One important caveat: these gains came from a workload with strong per-entity locality and cacheable data. Uniform access patterns or heavily join-based workloads will see smaller improvements. Your mileage depends on key distribution, working set size relative to per-node cache, and read/write ratio.
The Tradeoffs: What This Costs You
Deterministic routing isn't free. Here's what you give up:
Increased coupling between routing and key schema. The ingress layer now needs to know which request field contains the primary key and how to extract it. Not every request has a single, stable key at the ingress boundary — multi-entity transactions, batch operations, and fan-out queries don't map cleanly to one hash. You need to define which request paths get affinity routing and which fall back to standard load balancing.
Uneven load under key skew. If some primary keys generate significantly more traffic than others, deterministic routing concentrates that traffic on specific nodes instead of spreading it. In practice, key distributions often follow power laws — a small percentage of accounts or entities drive a large share of traffic. You need monitoring that detects per-node load imbalance and an overflow mechanism (like the hot-key table mentioned above) to redistribute outliers. Without this, you trade random variance for deterministic hot spots, which can be worse for tail latency.
More complex capacity planning. You can't simply "add a pod" and expect linear scaling. Adding a node changes the hash ring, which triggers a redistribution of partition assignments. That redistribution temporarily reduces cache hit rates across all affected nodes. Capacity changes need to be planned and executed during low-traffic windows, with a warm-up period factored in.
Harder operational reasoning. With random routing, every node looks the same in monitoring dashboards. With deterministic routing, each node has a distinct traffic profile. Your alerting, dashboards, and runbooks need to account for per-partition metrics, not just per-node averages. "Why is node X slow?" now requires tracing keys to partitions to understand whether it's a routing problem, a data skew problem, or a leaseholder drift at the database layer.
Leaseholder drift at the database tier. Even with lease preferences pinning leaseholders to predictable locations, the database can move leaseholders autonomously during rebalancing or node failures. The database client handles this transparently — but during the drift window, queries take an extra hop. If you're not monitoring gateway node hit rates in your database connection metrics, you won't notice the degradation until it shows up in application P95.
None of these tradeoffs were dealbreakers for us, but they're real. If you adopt this pattern, go in with your eyes open.
When to Use This (and When Not To)
Deterministic routing pays off when three conditions are met:
\

\ Stable request-to-data affinity. The primary key used for routing must be known at the ingress boundary and must be stable for the duration of the request's lifecycle. Transactional systems with account-based or entity-based keys are ideal candidates. Systems where the relevant data partition isn't known until deep in the processing pipeline are not.
Tail latency matters more than average latency. If your SLA is defined by P50 and you have generous timeout budgets, random routing is probably fine. If your SLA is defined by P95 or P99, and every millisecond of tail latency compounds into retry storms and timeout cascades, deterministic routing's variance reduction is worth the complexity.
Scale where cache and connection efficiency compounds. At 50 TPS, the difference between a 70% and 94% cache hit rate is a few extra database calls per second — barely noticeable. At 1500+ TPS, it's the difference between a system that hums and one that periodically falls over. The ROI of deterministic routing scales with traffic volume.
When to skip it: If your data access is truly random (analytics workloads, ad-hoc queries), if your system is small enough that every node can hold the entire working set in cache, or if your SLAs are generous enough that tail latency variance is within tolerance — don't add the complexity. Random routing is simpler to operate, simpler to reason about, and perfectly adequate for many systems.
Consider simpler alternatives first. Before committing to full deterministic routing, evaluate whether lighter-weight approaches close the gap: Istio consistent hashing alone (often the single highest-ROI change), power-of-two-choices or latency-aware balancing in Envoy, larger per-pod caches or a shared cache tier (Redis), or more aggressive use of follower reads. Deterministic routing compounds on top of these — it doesn't replace them.
Implementation Checklist
For teams ready to adopt this pattern:
- Inject a stable
x-routing-keyheader early in the request path (gateway or EnvoyFilter). - Apply an Istio
DestinationRulewithconsistentHashon that header. - Configure CockroachDB zone
lease_preferencesfor high-traffic tables. - Enable
follower_read_timestamp()or bounded staleness for eligible read paths. - Add monitoring: per-pod cache hit rates, per-key load distribution, leaseholder gateway hit rates.
- Test under stress: scaling events, pod failures, hot keys, rolling deployments, and mixed workloads.
Latency Isn't a Tuning Problem. It's a Routing Problem.
Most teams chase latency by optimizing what happens after the request arrives — faster queries, bigger caches, more threads, better garbage collection tuning. Those optimizations matter, but they're downstream of a more fundamental decision: where the request lands in the first place.
If the routing layer sends a request to a node that doesn't have the relevant data cached, doesn't have a warm connection to the right database partition, and is already contending with unrelated requests for different partitions — no amount of application-level tuning will recover those lost milliseconds. And if that request then bounces across the database cluster because it landed on a replica instead of the leaseholder — you've paid for the hop twice.
The fastest request is the one that never had to cross a boundary it didn't need to — at the application layer or the data layer.
Deterministic routing isn't a silver bullet. It requires operational investment, careful monitoring, leaseholder alignment, and a willingness to accept coupling between your routing and data layers. But for systems where tail latency is the difference between meeting your SLA and triggering a cascade, it's one of the highest-leverage architectural changes you can make.
Sometimes the biggest performance win isn't deeper in the stack. It's at the front door — and the door behind it.
\
This content originally appeared on HackerNoon and was authored by Ritvik Pandya
Ritvik Pandya | Sciencx (2026-05-01T04:42:41+00:00) Deterministic Routing: The Hidden Key to Low Latency. Retrieved from https://www.scien.cx/2026/05/01/deterministic-routing-the-hidden-key-to-low-latency/
Please log in to upload a file.
There are no updates yet.
Click the Upload button above to add an update.