Exploring Qwen3.6: Coding Benchmarks and Speed

Łukasz Grochal
Source:Alibaba | Qwen

Qwen3.6-35B-A3B just dropped as the first open-weight release in the series, focusing on real-world stability after community feedback on earlier Qwen3.5 versions. Built with a hybrid setup of Gated DeltaNet and MoE layers (256 experts, 8 routed plus 1 shared), it handles massive contexts up to 262k tokens natively, stretchable to over a million with tricks like YaRN. Key upgrades hit agentic coding hard: it shines in benchmarks like SWE-bench Verified at 73.4% (edging out Qwen3.5-35B-A3B's 70%), Terminal-Bench 2.0 at 51.5%, and Claw-Eval avg at 68.7%, showing better repo-level reasoning and frontend tasks.

Users on forums rave about local performance, hitting 170 tokens/sec on a 5090+4090 setup with full 260k context in Q8 quant, calling it snappy for everyday agent work without the usual local model quirks. It supports thinking mode by default (toggleable via API params like enable_thinking: false), preserves historical reasoning for iterative chats, and packs vision/video understanding too, scoring high on MMMU (81.7%) and RealWorldQA (85.3%). Compared to rivals like Gemma4-31B or Claude-Sonnet-4.5, it often leads in coding agents (e.g., SWE-bench Pro 49.5% vs. 35.7%) but holds steady rather than dominates everywhere, like tying on some MMLU-Pro scores around 85%.

Benchmark

Qwen3.6-35B-A3B

Qwen3.5-35B-A3B

Gemma4-31B

Claude-Sonnet-4.5

SWE-bench Verified

73.4%

70.0%

68.2%

72.1%

SWE-bench Pro

49.5%

46.8%

35.7%

51.2%

Terminal-Bench 2.0

51.5%

48.3%

47.1%

53.4%

Claw-Eval (avg)

68.7%

65.2%

64.9%

70.3%

MMLU-Pro

85.2%

84.8%

86.1%

88.7%

MMMU (vision)

81.7%

79.4%

78.6%

82.9%

RealWorldQA

85.3%

83.1%

82.4%

86.5%

Video VQA

83.7%

81.2%

N/A

84.1%

Deployment is straightforward with vLLM, SGLang, or Transformers, optimized for multi-token prediction and tool calling via Qwen-Agent. It's not flawless: needs 8-GPU tensor parallel for peak speed, and long contexts demand careful memory tweaks to avoid OOM. Overall, it steps up local AI game for coders and agents, balancing power with runnability better than many mid-sized peers, though giants like closed models still edge specialized evals.

References
4 sources
01
qwen.aiQwen
02
github.comGitHub
03
huggingface.coHugging Face
04
modelscope.cnModelScope
Palantier Dilemma Human Rights vs Sercurity

Europe's Palantir Boom Amid Sovereignty and Rights Fears

Project Glasswing: Anthropic Mythos Zero-Day Exploit Finder Art

Claude Mythos Leak Ignites Fears of Unstoppable AI Exploits

OpenRouter LLM Leaderboard April

Chinese AI Models Dominate OpenRouter Top Six in Token Usage

Claude Code’s Big npm Leak

Inside the Claude Code Leak and Anthropic’s Agent Design

China AI accelerator card shipments vs NVIDIA 2025 chart

NVIDIA’s AI Chip Share in China Drops from 95% to 55%

TurboQuant KV Cache Compression Visualization

Google’s TurboQuant makes AI caches smaller and faster

Black Forest Labs FLUX.2 klein

FLUX.2 klein 9B-KV Explained: Speed, Quality, GPUs

Nvidia Slashes LLM Context Memory With KVTC Design

KVTC: Nvidia’s 20x LLM Memory Cut Without Retraining

OpenAI Sora shutdown concept

Sora’s Short Life: Inside OpenAI’s Quiet Retreat

Stitch (stitch.withgoogle.com) experimental Google Labs tool

Google Stitch: From simple prompt to working app UI