This content originally appeared on DEV Community and was authored by Mohit Agnihotri
This is a submission for the Redis AI Challenge: Real-Time AI Innovators.
What I Built
Latency Slayer is a tiny Rust reverse-proxy that sits in front of any LLM API.
It uses embeddings + vector search in Redis 8 to detect “repeat-ish” prompts and return a cached answer instantly. New prompts are answered once by the LLM and stored with per-field TTLs, so only the response expires while metadata persists.
Why it matters: dramatically lower latency and cost, with transparent drop-in integration for any chat or RAG app.
Core tricks
- Redis Query Engine + HNSW vectors (COSINE) to find semantically similar earlier prompts.
-
Hash field expiration (
HSETEX
/HGETEX
) so we can expire just the “response” field without deleting the whole hash. - Redis Streams for real-time hit-rate & latency metrics, rendered in a tiny dashboard.
Demo
Screenshots:
How I Used Redis 8
- Vector search (HNSW, COSINE) on a HASH document that stores an embedding field (FP32, 1536-d from OpenAI text-embedding-3-small).
-
Per-field TTL on hashes:
HSETEX
to set the response field and its TTL in a single step;HGETEX
to read and optionally refresh TTLs. This gives granular cache lifetimes without deleting other fields (like usage or model metadata). -
Redis Streams:
XADD analytics:cache
per request; the dashboard subscribes and renders hit rate, token savings, and latency deltas in real time.
Data model (simplified)
-
cache:{fingerprint}
→ Hash fields:prompt
,resp
,meta
,usage
,created_at
(withresp
having its own TTL) -
vec:{fingerprint}
→ Vector field + tags (model
,route
,user
) - Stream:
analytics:cache
with{event, hit, latency_ms, tokens_saved}
Why Redis 8?
- New field-level expiration commands on hashes make cache lifecycle clean and safe.
- New int8 vectors keep memory low and speed high.
- Battle-tested Streams/PubSub give us real-time observability with a tiny footprint.
What’s next
- Prefetch: predict likely next prompts and warm them proactively.
- Hybrid filters: combine vector similarity + tags (model/route) for stricter cache hits.
- Cold-start tuning: adapt hit threshold by route and user cohort.
- Currently storing FP32 vectors for simplicity; INT8 quantization is planned to lower memory and speed up search
This content originally appeared on DEV Community and was authored by Mohit Agnihotri

Mohit Agnihotri | Sciencx (2025-08-10T17:57:45+00:00) Latency Slayer: a Redis 8 semantic cache gateway that makes LLMs feel instant. Retrieved from https://www.scien.cx/2025/08/10/latency-slayer-a-redis-8-semantic-cache-gateway-that-makes-llms-feel-instant/
Please log in to upload a file.
There are no updates yet.
Click the Upload button above to add an update.