Qwen3.5 Medium Matches Sonnet 4.5 on Local Hardware

Łukasz Grochal

Alibaba's Qwen team released the Qwen3.5-Medium model series recently, highlighting efficiency with Mixture-of-Experts (MoE) designs that activate only a fraction of total parameters during inference. Key variants include Qwen3.5-35B-A3B (35B total, 3B active), Qwen3.5-27B (dense), Qwen3.5-122B-A10B (122B total, 10B active), and a production-tuned Qwen3.5-Flash based on the 35B. These are fully open-source on Hugging Face, allowing free downloads, fine-tuning, and local runs, unlike closed models from Anthropic or OpenAI.

What's new? They use hybrid Gated Delta Networks with standard attention for faster decoding and lower memory use, plus native tool calling, agentic workflows, and a default 1M token context (up to 262k recommended to avoid OOM). Benchmarks show the 35B-A3B outperforming the prior Qwen3-235B-A22B and even some multimodal siblings on tests like LiveCodeBench, AIME26, and reasoning evals. It edges close to or matches Claude Sonnet 4.5 in areas like GPQA while beating it in others like MMMLU equivalents, though Sonnet 4.5 leads in some coding benches like SWE-Bench at 55% pass@5.

Compared to rivals, Qwen3.5-Medium sits in the "Goldilocks" zone: smaller than giants like Qwen3.5-397B-A17B (needs 200+GB quantized) but smarter than many 7B/13B opens like Llama 3.1. Vs Sonnet 4.5 (closed, API-only), it's local-run friendly with similar reasoning but higher throughput on decent setups. Llama 4 or Mistral Large 2 might compete in speed, but Qwen shines in RL-tuned logic and long-context without heavy RAG.

Hardware needs are modest for mediums: 27B fits ~18GB RAM/VRAM (e.g., single RTX 4090), 35B-A3B ~24GB quantized Q4, 122B ~70GB. Speeds hit 25+ tok/s on 24GB GPU with MoE offload, way faster than dense 235B predecessors. Use vLLM/SGLang for prod, or llama.cpp for local. Drawbacks? Quantized versions trade some accuracy, and full precision demands enterprise GPUs. Overall, it's a balanced step forward for open models targeting devs wanting power without cloud costs.

References(4)
Sources
Palantier Dilemma Human Rights vs Sercurity

Europe's Palantir Boom Amid Sovereignty and Rights Fears

Project Glasswing: Anthropic Mythos Zero-Day Exploit Finder Art

Claude Mythos Leak Ignites Fears of Unstoppable AI Exploits

OpenRouter LLM Leaderboard April

Chinese AI Models Dominate OpenRouter Top Six in Token Usage

Claude Code’s Big npm Leak

Inside the Claude Code Leak and Anthropic’s Agent Design

China AI accelerator card shipments vs NVIDIA 2025 chart

NVIDIA’s AI Chip Share in China Drops from 95% to 55%

TurboQuant KV Cache Compression Visualization

Google’s TurboQuant makes AI caches smaller and faster

Black Forest Labs FLUX.2 klein

FLUX.2 klein 9B-KV Explained: Speed, Quality, GPUs

Nvidia Slashes LLM Context Memory With KVTC Design

KVTC: Nvidia’s 20x LLM Memory Cut Without Retraining

OpenAI Sora shutdown concept

Sora’s Short Life: Inside OpenAI’s Quiet Retreat

Stitch (stitch.withgoogle.com) experimental Google Labs tool

Google Stitch: From simple prompt to working app UI