Latency Slayer: a Redis 8 semantic cache gateway that makes LLMs feel instant

This is a submission for the Redis AI Challenge: Real-Time AI Innovators.

What I Built

Latency Slayer is a tiny Rust reverse-proxy that sits in front of any LLM API.
It uses embeddings + vector search in Redis 8 to detect “repeat-ish” pro…


This content originally appeared on DEV Community and was authored by Mohit Agnihotri

This is a submission for the Redis AI Challenge: Real-Time AI Innovators.

Cover

What I Built

Latency Slayer is a tiny Rust reverse-proxy that sits in front of any LLM API.

It uses embeddings + vector search in Redis 8 to detect “repeat-ish” prompts and return a cached answer instantly. New prompts are answered once by the LLM and stored with per-field TTLs, so only the response expires while metadata persists.

Why it matters: dramatically lower latency and cost, with transparent drop-in integration for any chat or RAG app.

Core tricks

  • Redis Query Engine + HNSW vectors (COSINE) to find semantically similar earlier prompts.
  • Hash field expiration (HSETEX / HGETEX) so we can expire just the “response” field without deleting the whole hash.
  • Redis Streams for real-time hit-rate & latency metrics, rendered in a tiny dashboard.

Demo

Screenshots:

Dashboard Screenshot1

Dashboard Screenshot2

How I Used Redis 8

  • Vector search (HNSW, COSINE) on a HASH document that stores an embedding field (FP32, 1536-d from OpenAI text-embedding-3-small).
  • Per-field TTL on hashes: HSETEX to set the response field and its TTL in a single step; HGETEX to read and optionally refresh TTLs. This gives granular cache lifetimes without deleting other fields (like usage or model metadata).
  • Redis Streams: XADD analytics:cache per request; the dashboard subscribes and renders hit rate, token savings, and latency deltas in real time.

Data model (simplified)

  • cache:{fingerprint} → Hash fields: prompt, resp, meta, usage, created_at (with resp having its own TTL)
  • vec:{fingerprint} → Vector field + tags (model, route, user)
  • Stream: analytics:cache with {event, hit, latency_ms, tokens_saved}

Why Redis 8?

  • New field-level expiration commands on hashes make cache lifecycle clean and safe.
  • New int8 vectors keep memory low and speed high.
  • Battle-tested Streams/PubSub give us real-time observability with a tiny footprint.

What’s next

  • Prefetch: predict likely next prompts and warm them proactively.
  • Hybrid filters: combine vector similarity + tags (model/route) for stricter cache hits.
  • Cold-start tuning: adapt hit threshold by route and user cohort.
  • Currently storing FP32 vectors for simplicity; INT8 quantization is planned to lower memory and speed up search


This content originally appeared on DEV Community and was authored by Mohit Agnihotri


Print Share Comment Cite Upload Translate Updates
APA

Mohit Agnihotri | Sciencx (2025-08-10T17:57:45+00:00) Latency Slayer: a Redis 8 semantic cache gateway that makes LLMs feel instant. Retrieved from https://www.scien.cx/2025/08/10/latency-slayer-a-redis-8-semantic-cache-gateway-that-makes-llms-feel-instant/

MLA
" » Latency Slayer: a Redis 8 semantic cache gateway that makes LLMs feel instant." Mohit Agnihotri | Sciencx - Sunday August 10, 2025, https://www.scien.cx/2025/08/10/latency-slayer-a-redis-8-semantic-cache-gateway-that-makes-llms-feel-instant/
HARVARD
Mohit Agnihotri | Sciencx Sunday August 10, 2025 » Latency Slayer: a Redis 8 semantic cache gateway that makes LLMs feel instant., viewed ,<https://www.scien.cx/2025/08/10/latency-slayer-a-redis-8-semantic-cache-gateway-that-makes-llms-feel-instant/>
VANCOUVER
Mohit Agnihotri | Sciencx - » Latency Slayer: a Redis 8 semantic cache gateway that makes LLMs feel instant. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2025/08/10/latency-slayer-a-redis-8-semantic-cache-gateway-that-makes-llms-feel-instant/
CHICAGO
" » Latency Slayer: a Redis 8 semantic cache gateway that makes LLMs feel instant." Mohit Agnihotri | Sciencx - Accessed . https://www.scien.cx/2025/08/10/latency-slayer-a-redis-8-semantic-cache-gateway-that-makes-llms-feel-instant/
IEEE
" » Latency Slayer: a Redis 8 semantic cache gateway that makes LLMs feel instant." Mohit Agnihotri | Sciencx [Online]. Available: https://www.scien.cx/2025/08/10/latency-slayer-a-redis-8-semantic-cache-gateway-that-makes-llms-feel-instant/. [Accessed: ]
rf:citation
» Latency Slayer: a Redis 8 semantic cache gateway that makes LLMs feel instant | Mohit Agnihotri | Sciencx | https://www.scien.cx/2025/08/10/latency-slayer-a-redis-8-semantic-cache-gateway-that-makes-llms-feel-instant/ |

Please log in to upload a file.




There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.