DeepSeek V4: 1M‑Token Context and Budget Frontier AI Models

DeepSeek V4 is a new family of large‑scale open‑weights AI models from the Chinese lab DeepSeek, released in April 2026 as a “preview” under the MIT license. It consists of two main variants: DeepSeek‑V4‑Pro (around 1.6 trillion total parameters, 49 billion active per token) and DeepSeek‑V4‑Flash (about 284 billion total parameters, 13 billion active). Both models support a 1 million‑token context window, use a Mixture‑of‑Experts (MoE) architecture, and are positioned as highly efficient, especially for long‑context workloads.

New technical tricks, such as a hybrid “CSA + HCA” attention mechanism and ultra‑sparse KV‑cache handling, let DeepSeek compress computation and memory in million‑token runs, so that even the giant Pro model uses only a fraction of the FLOPs and KV‑cache of its earlier V3.2. This efficiency is why DeepSeek can price V4‑Flash at about 0.14 USD per million input tokens and 0.28 USD per million output tokens, and V4‑Pro at 1.74 / 3.48 USD per million, making both significantly cheaper than comparable frontier models from OpenAI, Anthropic, and Google.

Benchmarks from third‑party sites such as Arena.ai and Vals AI put DeepSeek‑V4‑Pro among the top open‑source models, especially in code and agent‑style tasks. In some code‑arena rankings, V4‑Pro (thinking mode) lands in the top‑3 open‑source models and ahead of several closed‑source systems, while V4‑Flash scores over 10× higher than V3.2 on Vals’ Vibe Code Benchmark. Internal and external evaluations also suggest that V4‑Pro is roughly on par with leading closed‑source models like Gemini‑3.1‑Pro and GPT‑5‑ish systems in reasoning and math, though still a bit behind the absolute cutting‑edge, which DeepSeek itself frames as a 3–6‑month gap.

On the hardware side, DeepSeek emphasizes domestic Chinese compute: V4 is said to be the first trillion‑scale model trained on Huawei Ascend NPUs and other local chips, with fine‑grained expert‑parallelism optimizations that push inference speed up by roughly 1.5–1.7× on Ascend hardware. The Pro model’s raw size (over 1.6 trillion parameters) means it is aimed at data‑center‑scale deployments, while the smaller Flash variant is more suited for local or edge‑like stacks, with some enthusiasts expecting quantized Flash builds to run on high‑end laptops (for example, 128 GB RAM systems).

Controversies and debates around V4 are relatively mild so far: the main talking points are the price–performance jump versus U.S.‑style frontier models, questions about whether the long‑context features are practically useful for most users, and some skepticism from long‑time users who feel the day‑to‑day experience of Flash is not clearly better than V3.2. DeepSeek also openly warns that Pro‑tier throughput is still limited by high‑end compute availability, with hints that prices could drop further once newer Ascend 950 super‑node hardware rolls out in the second half of 2026.

Model / variant	Approx. size (total params)	Active params per token	Context length	Input price / 1M tokens (USD)	Output price / 1M tokens (USD)	Notable code‑bench remarks
DeepSeek‑V4‑Pro	~1.6 trillion	~49B	1M tokens	1.74	3.48	Top‑3 open‑source in code arena; 10× leap vs V3.2 on Vals Vibe Code
DeepSeek‑V4‑Flash	~284B	~13B	1M tokens	0.14	0.28	Lowest‑priced small‑scale model; strong in rapid coding tasks
Gemini‑3.1‑Pro (closed)	undisclosed	undisclosed	up to 1M	~2.00	~12.00	Often used as reference for “top‑tier” closed code‑reasoning
GPT‑5.4 (closed, large‑tier)	undisclosed	undisclosed	long‑context	~2.50	~15.00	Benchmark baseline for advanced reasoning and coding
Claude‑Sonnet‑4.6 (closed)	undisclosed	undisclosed	long‑context	~3.00	~15.00	Frequently compared to V4 in agent‑style coding runs