DeepSeek Open-Sources Nano-VLLM

23 czerwca 2025Author: Łukasz Grochal

DeepSeek researchers have open-sourced Nano-VLLM, a minimalist implementation of the popular VLLM (LLM serving system) built from scratch with three key optimizations:

5x Smaller Memory Footprint (under 2GB for 7B models)

Achieved via dynamic page-aware KV cache compression
Supports CPU offloading for edge devices

Python-First Architecture

No CUDA dependencies (unlike original VLLM)
Pure NumPy backend for prototyping

Experimental Features

Speculative token prefill - 30% faster first token
BitDelta quantization - 3-bit weights with <1% perplexity loss

Currently supports LLaMA3 and DeepSeek's own 7B models. Benchmark shows 23 tokens/sec on consumer CPUs (i7-13700K).

SOURCE:
MarkTechPost
SOURCE:
GitHub