Exploring Qwen3.6: Coding Benchmarks and Speed

Qwen3.6-35B-A3B just dropped as the first open-weight release in the series, focusing on real-world stability after community feedback on earlier Qwen3.5 versions. Built with a hybrid setup of Gated DeltaNet and MoE layers (256 experts, 8 routed plus 1 shared), it handles massive contexts up to 262k tokens natively, stretchable to over a million with tricks like YaRN. Key upgrades hit agentic coding hard: it shines in benchmarks like SWE-bench Verified at 73.4% (edging out Qwen3.5-35B-A3B's 70%), Terminal-Bench 2.0 at 51.5%, and Claw-Eval avg at 68.7%, showing better repo-level reasoning and frontend tasks.

Users on forums rave about local performance, hitting 170 tokens/sec on a 5090+4090 setup with full 260k context in Q8 quant, calling it snappy for everyday agent work without the usual local model quirks. It supports thinking mode by default (toggleable via API params like enable_thinking: false), preserves historical reasoning for iterative chats, and packs vision/video understanding too, scoring high on MMMU (81.7%) and RealWorldQA (85.3%). Compared to rivals like Gemma4-31B or Claude-Sonnet-4.5, it often leads in coding agents (e.g., SWE-bench Pro 49.5% vs. 35.7%) but holds steady rather than dominates everywhere, like tying on some MMLU-Pro scores around 85%.

Benchmark	Qwen3.6-35B-A3B	Qwen3.5-35B-A3B	Gemma4-31B	Claude-Sonnet-4.5
SWE-bench Verified	73.4%	70.0%	68.2%	72.1%
SWE-bench Pro	49.5%	46.8%	35.7%	51.2%
Terminal-Bench 2.0	51.5%	48.3%	47.1%	53.4%
Claw-Eval (avg)	68.7%	65.2%	64.9%	70.3%
MMLU-Pro	85.2%	84.8%	86.1%	88.7%
MMMU (vision)	81.7%	79.4%	78.6%	82.9%
RealWorldQA	85.3%	83.1%	82.4%	86.5%
Video VQA	83.7%	81.2%	N/A	84.1%

Deployment is straightforward with vLLM, SGLang, or Transformers, optimized for multi-token prediction and tool calling via Qwen-Agent. It's not flawless: needs 8-GPU tensor parallel for peak speed, and long contexts demand careful memory tweaks to avoid OOM. Overall, it steps up local AI game for coders and agents, balancing power with runnability better than many mid-sized peers, though giants like closed models still edge specialized evals.