20250728-Qwen3-235B-A22B-Thinking-2507

原文摘要

Qwen3-235B-A22B-Thinking-2507

The third Qwen model release week, following Qwen3-235B-A22B-Instruct-2507 on Monday 21st and Qwen3-Coder-480B-A35B-Instruct on Tuesday 22nd.

Those two were both non-reasoning models - a change from the previous models in the Qwen 3 family which combined reasoning and non-reasoning in the same model, controlled by /think and /no_think tokens.

Today's model, Qwen3-235B-A22B-Thinking-2507 (also released as an FP8 variant), is their new thinking variant.

Qwen claim "state-of-the-art results among open-source thinking models" and have increased the context length to 262,144 tokens - a big jump from April's Qwen3-235B-A22B which was "32,768 natively and 131,072 tokens with YaRN".

Their own published benchmarks show comparable scores to DeepSeek-R1-0528, OpenAI's o3 and o4-mini, Gemini 2.5 Pro and Claude Opus 4 in thinking mode.

The new model is already available via OpenRouter.

But how good is its pelican?

I tried it with "Generate an SVG of a pelican riding a bicycle" via OpenRouter, and it thought for 166 seconds - nearly three minutes! I have never seen a model think for that long. No wonder the documentation includes the following:

However, since the model may require longer token sequences for reasoning, we strongly recommend using a context length greater than 131,072 when possible.

Here's a copy of that thinking trace. It was really fun to scan through:

Qwen3 235B A22B Thinking 2507 Seat at (200,200). The pelican's body will be: - The main body: a rounded shape starting at (200,200) and going to about (250, 250) [but note: the pelican is sitting, so the body might be more upright?] - Head: at (200, 180) [above the seat] and the beak extending forward to (280, 180) or so. We'll design the pelican as: - Head: a circle at (180, 170) with radius 15. - Beak: a long triangle from (180,170) to (250,170) and then down to (250,180) and back? Actually, the beak is a long flat-bottomed triangle.

The finished pelican? Not so great! I like the beak though:

Description by Claude Sonnet 4: Minimalist flat illustration featuring a white bird character with orange beak, a purple rectangular tablet or device, gray cloud-like shapes, two black "T" letters, colorful geometric elements including orange and teal triangular shapes, scattered orange and green dots across a light background, and a thin black line at the bottom

Via @Alibaba_Qwen

Tags: ai, generative-ai, llms, qwen, pelican-riding-a-bicycle, llm-reasoning, llm-release

[原文链接](https://simonwillison.net/2025/Jul/25/qwen3-235b-a22b-thinking-2507/#atom-everything)

进一步信息揣测

- **Qwen3系列模型的推理模式分离**:此前Qwen3家族的模型将推理和非推理功能整合在同一个模型中,通过`/think`和`/no_think`令牌控制,但最新发布的Qwen3-235B-A22B-Thinking-2507和Qwen3-Coder-480B-A35B-Instruct等模型将两者分离,可能是为了优化性能或降低成本。 - **超长推理时间的潜在问题**:用户测试生成SVG时,模型思考时间长达166秒(近3分钟),远超常规LLM响应时间。官方文档建议使用超过131,072 tokens的上下文长度以支持推理,暗示长上下文可能显著影响推理效率,实际应用中需权衡性能与效果。 - **FP8变体的隐含优化**:模型提供FP8(8位浮点)版本,通常用于减少计算资源和内存占用,但可能牺牲精度。这表明Qwen在硬件适配和商业化部署上更注重性价比,适合资源受限场景。 - **基准测试的“选择性对标”**:Qwen自称在开源推理模型中达到SOTA,但仅对比了DeepSeek-R1、OpenAI o3/o4-mini、Gemini 2.5 Pro和Claude Opus 4等特定版本,未提及更强大的闭源模型(如GPT-5或Claude 5),可能存在宣传策略性取舍。 - **OpenRouter的快速接入**:模型发布后立即上线OpenRouter,说明Qwen团队与第三方平台有深度合作或预集成流程,可能通过此类渠道优先获取用户反馈或商业化试水。 - **实际输出质量的落差**:尽管模型在复杂推理(如SVG设计)中表现出详细的分步思考(如坐标计算、形状拆解),但最终生成的鹈鹕骑自行车图像质量较差(被Claude描述为“极简主义”),反映推理逻辑与生成能力的不匹配,需警惕“过度思考低质量输出”的陷阱。 - **上下文长度跳跃的代价**:模型上下文从32,768/131,072 tokens直接提升至262,144 tokens,但未提及硬件需求或延迟增加的具体数据,可能隐含对高性能硬件的强依赖,普通开发者需谨慎评估部署成本。 - **“鹈鹕骑自行车”的隐藏测试意义**:作者反复用此任务测试模型,可能因该需求综合了空间推理、创意生成和细节控制,是业内非公开的“压力测试”用例,用于暴露模型边界。