vAttention System Design: Dynamic KV-Cache with Contiguous Virtual Memory Post date June 12, 2025 Post author By Text Generation Post categories In contiguous-virtual-memory, dynamic-memory-allocation, gpu-memory, kv-cache-management, llm-inference, system-architecture, system-design, vattention
Self-Speculative Decoding Speeds for Multi-Token LLMs Post date June 6, 2025 Post author By Large Models (dot tech) Post categories In ai-efficiency, code-generation, inference-optimization, llm-decoding-speed, llm-inference, multi-token-models, multi-token-prediction, self-speculative-decoding
AI Innovations and Insights 22: LLM Inference, SubgraphRAG, and FastRAG Post date January 27, 2025 Post author By Florian June Post categories In ai, graphrag, large-language-models, llm-inference, retrieval-augmented-gen