KVTC: Nvidia’s 20x LLM Memory Cut Without Retraining

Nvidia’s new technique, KV Cache Transform Coding (KVTC), compresses the key value cache used during inference so that large language models can use up to 20 times less memory for conversation history without touching the model weights or architecture. Instead of shrinking the model itself, it treats the KV cache like multimedia data and applies ideas similar to JPEG: dimensionality reduction with PCA, adaptive precision per feature, and entropy coding via DEFLATE, accelerated by Nvidia’s nvCOMP library.

The compression runs between inference phases, so the hot path of token generation stays fast, enabling up to about 8x faster time to first token with typically under 1 percent accuracy loss, even at aggressive compression levels beyond 20x in some tests. Because KVTC only changes how caches are stored and moved, it can be slotted into existing serving stacks like Nvidia Dynamo’s KV Block Manager and open source runtimes such as vLLM without retraining the model, which makes it attractive for production and for open models. In practice this means operators can run longer context windows and more concurrent sessions on the same GPUs, lower memory related costs, and resume multi turn chats without constantly recomputing history, which is particularly useful for coding assistants, enterprise chatbots, and other applications that keep long conversations alive.

Nvidia positions KVTC as a near term, non intrusive infrastructure feature that can be rolled out in software updates rather than a distant research prototype, and early reports indicate it already integrates with Nvidia’s broader inference stack for both proprietary and open source models.