DeepSeek Open-Sources Nano-VLLM
DeepSeek Nano-vLLM
23 czerwca 2025Author: Łukasz Grochal

DeepSeek researchers have open-sourced Nano-VLLM, a minimalist implementation of the popular VLLM (LLM serving system) built from scratch with three key optimizations:


  1. 5x Smaller Memory Footprint (under 2GB for 7B models)


  • Achieved via dynamic page-aware KV cache compression
  • Supports CPU offloading for edge devices


  1. Python-First Architecture


  • No CUDA dependencies (unlike original VLLM)
  • Pure NumPy backend for prototyping


  1. Experimental Features


  • Speculative token prefill - 30% faster first token
  • BitDelta quantization - 3-bit weights with <1% perplexity loss


Currently supports LLaMA3 and DeepSeek's own 7B models. Benchmark shows 23 tokens/sec on consumer CPUs (i7-13700K).