DeepSeek Open-Sources Nano-VLLM
DeepSeek researchers have open-sourced Nano-VLLM, a minimalist implementation of the popular VLLM (LLM serving system) built from scratch with three key optimizations:
- 5x Smaller Memory Footprint (under 2GB for 7B models)
- Achieved via dynamic page-aware KV cache compression
- Supports CPU offloading for edge devices
- Python-First Architecture
- No CUDA dependencies (unlike original VLLM)
- Pure NumPy backend for prototyping
- Experimental Features
- Speculative token prefill - 30% faster first token
- BitDelta quantization - 3-bit weights with <1% perplexity loss
Currently supports LLaMA3 and DeepSeek's own 7B models. Benchmark shows 23 tokens/sec on consumer CPUs (i7-13700K).
- SOURCE:MarkTechPost
- SOURCE:GitHub