KVTC: Nvidia’s 20x LLM Memory Cut Without Retraining

Łukasz Grochal
Generated by AI·Qwen-Image-2512

Nvidia’s new technique, KV Cache Transform Coding (KVTC), compresses the key value cache used during inference so that large language models can use up to 20 times less memory for conversation history without touching the model weights or architecture. Instead of shrinking the model itself, it treats the KV cache like multimedia data and applies ideas similar to JPEG: dimensionality reduction with PCA, adaptive precision per feature, and entropy coding via DEFLATE, accelerated by Nvidia’s nvCOMP library.

The compression runs between inference phases, so the hot path of token generation stays fast, enabling up to about 8x faster time to first token with typically under 1 percent accuracy loss, even at aggressive compression levels beyond 20x in some tests. Because KVTC only changes how caches are stored and moved, it can be slotted into existing serving stacks like Nvidia Dynamo’s KV Block Manager and open source runtimes such as vLLM without retraining the model, which makes it attractive for production and for open models. In practice this means operators can run longer context windows and more concurrent sessions on the same GPUs, lower memory related costs, and resume multi turn chats without constantly recomputing history, which is particularly useful for coding assistants, enterprise chatbots, and other applications that keep long conversations alive.

Nvidia positions KVTC as a near term, non intrusive infrastructure feature that can be rolled out in software updates rather than a distant research prototype, and early reports indicate it already integrates with Nvidia’s broader inference stack for both proprietary and open source models.

References(3)
Sources
OpenSandbox Logo

OpenSandbox: A Unified Sandbox Layer For AI Agents

suno style local music studio ui screenshot

Local ACE-Step Studio: Suno-Style Music on Your PC

Artist designing AI image pipeline with ComfyUI nodes

Inside ComfyUI: Power Tools For Visual Creators

OpenClaw AI Agent Dashboard Monitoring Crypto Wallets

From Clawdbot To OpenClaw: Power, Hype And Weak Locks

Personal AI operating system concept with OpenClaw

OpenClaw And The New Era Of Personal AI Agents

Black Forest Labs FLUX.2 klein

FLUX.2 klein 9B-KV Explained: Speed, Quality, GPUs

OpenAI Sora shutdown concept

Sora’s Short Life: Inside OpenAI’s Quiet Retreat

Stitch (stitch.withgoogle.com) experimental Google Labs tool

Google Stitch: From simple prompt to working app UI

PowerToys 0.98 Command Palette Dock on Windows 11 desktop

PowerToys March 2026: What Is New in Version 0.98

Close Up Of PSSR 2.0 Enhanced PS5 Pro Visual Details

PSSR 2.0 Makes PS5 Pro Feel Closer To A “PS6” Experience