Alibaba's Tiny Qwen Beats Big OpenAI Model

Łukasz Grochal

Alibaba's Qwen team just dropped the Qwen3.5 Small series, and the 9B version is the star here. This 9-billion-parameter model beats OpenAI's much larger gpt-oss-120B (over 120B params) on key tests like GPQA Diamond (81.7 vs 80.1) for grad-level reasoning, Video-MME (84.5) for video tasks, and MMMLU (81.2) for multilingual knowledge. It's built with a smart hybrid setup: Gated Delta Networks for faster attention and Mixture-of-Experts (MoE) to cut latency and boost throughput, making it native multimodal from the ground up. That means it handles text, images, and video without clunky add-ons, scoring 70.1 on MMMU-Pro visual reasoning, ahead of Gemini 2.5 Flash-Lite.

These models shine in real-world spots where big cloud models falter: edge devices, local agents, and offline apps. The 9B fits on standard laptops with 8-9GB RAM, perfect for devs building UI navigation tools, document parsing (87.7 on OmniDocBench), or code refactoring in a 262k-token context. Smaller siblings like 0.8B and 2B target phones for quick tasks, while 4B handles lightweight agents. Community buzz on Reddit and X calls it a game-changer for "local-first" AI, running smoothly on M1 MacBooks without cloud costs.

Against competition, Qwen3.5-9B holds its own or better in reasoning and multimodal without the scale of gpt-oss-120B or Gemini. It's not topping everything, like some trillion-param giants in raw power, but its efficiency stands out: 13x smaller yet competitive, with lower inference needs. OpenAI's model might edge in some speed metrics, but Qwen wins on accessibility and openness (Apache 2.0 license for commercial tweaks). Costs? Specific training figures aren't public for this release, but the series builds on Qwen3 efficiencies, with reports of 60% lower ops costs than prior big siblings like Qwen3-Max, thanks to more experts (512 vs 128) and scaled RL training. Inference is cheap locally: no API fees, just hardware you own, unlike paid rivals at $0.11+ per million tokens for similar open models. Drawbacks include potential hallucination cascades in agent workflows and higher VRAM for peak use, plus data residency questions for some enterprises. Overall, it's pushing small models toward agentic tasks like automating desktops or analyzing videos offline, balancing smarts with practicality.

References
3 sources
01
qwen.aiQwen.AI
02
github.comGitHub
03
huggingface.coHugging Face
Publishers Are Shutting Out Internet Archive

News Giants Block Wayback Machine Over AI Fears

Claude Design Launch: Brand-Aware AI Prototyping Image

Anthropic Launches Claude Design to Rival Figma Tools

Qwen3.6 Coding Agent Benchmarks Chart Visual

Exploring Qwen3.6: Coding Benchmarks and Speed

Palantier Dilemma Human Rights vs Sercurity

Europe's Palantir Boom Amid Sovereignty and Rights Fears

Project Glasswing: Anthropic Mythos Zero-Day Exploit Finder Art

Claude Mythos Leak Ignites Fears of Unstoppable AI Exploits

OpenRouter LLM Leaderboard April

Chinese AI Models Dominate OpenRouter Top Six in Token Usage

Claude Code’s Big npm Leak

Inside the Claude Code Leak and Anthropic’s Agent Design

China AI accelerator card shipments vs NVIDIA 2025 chart

NVIDIA’s AI Chip Share in China Drops from 95% to 55%

TurboQuant KV Cache Compression Visualization

Google’s TurboQuant makes AI caches smaller and faster

Black Forest Labs FLUX.2 klein

FLUX.2 klein 9B-KV Explained: Speed, Quality, GPUs