How DeepSeek Trains Powerful Models On A Budget

Łukasz Grochal

DeepSeek is pushing a new architecture idea called Manifold Constrained Hyper Connections that tries to fix a very practical headache in large models: training becomes unstable and expensive once you start widening residual streams and adding fancy connectivity patterns. Classic residual connections are stable but rigid, while newer hyper connection style designs boost accuracy at the cost of instability, memory overhead and poor scaling, which quickly translates into huge GPU and power bills. DeepSeek’s trick is to constrain the residual mapping onto a specific mathematical manifold, using doubly stochastic matrices and infrastructure level optimizations so signals stay well behaved even in deep, wide models.

In tests on models from roughly 3 to 27 billion parameters, this framework showed better scaling and efficiency, hinting that you can get more capability per watt instead of relying only on massive clusters. That fits the broader story around DeepSeek, which already surprised the industry with the low cost reasoning focused R1 family and keeps iterating up to the current 3.2 line while working on the flagship R2 model expected around Chinese New Year, a release many analysts think could again shake up the LLM leaderboard despite US export controls on advanced chips.

References
4 sources
01
iotinsider.comIOT Insider
02
researchgate.netResearch Gate
03
arxiv.orgarXiv
04
huggingface.coHugging Face
DeepSeek V4‑Pro 1.6T‑Parameter AI Model Architecture

DeepSeek V4: 1M‑Token Context and Budget Frontier AI Models

Palantir Manifesto Graphic: AI Defense and Culture Clash

Palantir Manifesto Hits at Regressive Cultures and AI Shift

OpenAI ChatGPT Images 2.0 feature overview

OpenAI Updates ChatGPT Images With Better Text

Publishers Are Shutting Out Internet Archive

News Giants Block Wayback Machine Over AI Fears

Claude Design Launch: Brand-Aware AI Prototyping Image

Anthropic Launches Claude Design to Rival Figma Tools

Qwen3.6 Coding Agent Benchmarks Chart Visual

Exploring Qwen3.6: Coding Benchmarks and Speed

Palantier Dilemma Human Rights vs Sercurity

Europe's Palantir Boom Amid Sovereignty and Rights Fears

Project Glasswing: Anthropic Mythos Zero-Day Exploit Finder Art

Claude Mythos Leak Ignites Fears of Unstoppable AI Exploits

OpenRouter LLM Leaderboard April

Chinese AI Models Dominate OpenRouter Top Six in Token Usage

Claude Code’s Big npm Leak

Inside the Claude Code Leak and Anthropic’s Agent Design

How DeepSeek Trains Powerful Models On A Budget | LucasGraphic