20250728-Qwen3-235B-A22B-Thinking-2507

原文摘要

The third Qwen model release week, following Qwen3-235B-A22B-Instruct-2507 on Monday 21st and Qwen3-Coder-480B-A35B-Instruct on Tuesday 22nd.

Those two were both non-reasoning models - a change from the previous models in the Qwen 3 family which combined reasoning and non-reasoning in the same model, controlled by /think and /no_think tokens.

Today's model, Qwen3-235B-A22B-Thinking-2507 (also released as an FP8 variant), is their new thinking variant.

Qwen claim "state-of-the-art results among open-source thinking models" and have increased the context length to 262,144 tokens - a big jump from April's Qwen3-235B-A22B which was "32,768 natively and 131,072 tokens with YaRN".

Their own published benchmarks show comparable scores to DeepSeek-R1-0528, OpenAI's o3 and o4-mini, Gemini 2.5 Pro and Claude Opus 4 in thinking mode.

The new model is already available via OpenRouter.

But how good is its pelican?

I tried it with "Generate an SVG of a pelican riding a bicycle" via OpenRouter, and it thought for 166 seconds - nearly three minutes! I have never seen a model think for that long. No wonder the documentation includes the following:

However, since the model may require longer token sequences for reasoning, we strongly recommend using a context length greater than 131,072 when possible.

Here's a copy of that thinking trace. It was really fun to scan through:

The finished pelican? Not so great! I like the beak though:

Description by Claude Sonnet 4: Minimalist flat illustration featuring a white bird character with orange beak, a purple rectangular tablet or device, gray cloud-like shapes, two black "T" letters, colorful geometric elements including orange and teal triangular shapes, scattered orange and green dots across a light background, and a thin black line at the bottom

Via @Alibaba_Qwen

Tags: ai, generative-ai, llms, qwen, pelican-riding-a-bicycle, llm-reasoning, llm-release

[原文链接](https://simonwillison.net/2025/Jul/25/qwen3-235b-a22b-thinking-2507/#atom-everything)

进一步信息揣测

- **Qwen3系列模型的推理模式分离**：此前Qwen3家族的模型将推理和非推理功能整合在同一个模型中，通过`/think`和`/no_think`令牌控制，但最新发布的Qwen3-235B-A22B-Thinking-2507和Qwen3-Coder-480B-A35B-Instruct等模型将两者分离，可能是为了优化性能或降低成本。 - **超长推理时间的潜在问题**：用户测试生成SVG时，模型思考时间长达166秒（近3分钟），远超常规LLM响应时间。官方文档建议使用超过131,072 tokens的上下文长度以支持推理，暗示长上下文可能显著影响推理效率，实际应用中需权衡性能与效果。 - **FP8变体的隐含优化**：模型提供FP8（8位浮点）版本，通常用于减少计算资源和内存占用，但可能牺牲精度。这表明Qwen在硬件适配和商业化部署上更注重性价比，适合资源受限场景。 - **基准测试的“选择性对标”**：Qwen自称在开源推理模型中达到SOTA，但仅对比了DeepSeek-R1、OpenAI o3/o4-mini、Gemini 2.5 Pro和Claude Opus 4等特定版本，未提及更强大的闭源模型（如GPT-5或Claude 5），可能存在宣传策略性取舍。 - **OpenRouter的快速接入**：模型发布后立即上线OpenRouter，说明Qwen团队与第三方平台有深度合作或预集成流程，可能通过此类渠道优先获取用户反馈或商业化试水。 - **实际输出质量的落差**：尽管模型在复杂推理（如SVG设计）中表现出详细的分步思考（如坐标计算、形状拆解），但最终生成的鹈鹕骑自行车图像质量较差（被Claude描述为“极简主义”），反映推理逻辑与生成能力的不匹配，需警惕“过度思考低质量输出”的陷阱。 - **上下文长度跳跃的代价**：模型上下文从32,768/131,072 tokens直接提升至262,144 tokens，但未提及硬件需求或延迟增加的具体数据，可能隐含对高性能硬件的强依赖，普通开发者需谨慎评估部署成本。 - **“鹈鹕骑自行车”的隐藏测试意义**：作者反复用此任务测试模型，可能因该需求综合了空间推理、创意生成和细节控制，是业内非公开的“压力测试”用例，用于暴露模型边界。