20250731-GLM-4.5_Reasoning,_Coding,_and_Agentic_Abililties

原文摘要

GLM-4.5: Reasoning, Coding, and Agentic Abililties

Another day, another significant new open weight model release from a Chinese frontier AI lab.

This time it's Z.ai - who rebranded (at least in English) from Zhipu AI a few months ago. They just dropped GLM-4.5-Base, GLM-4.5 and GLM-4.5 Air on Hugging Face, all under an MIT license.

These are MoE hybrid reasoning models with thinking and non-thinking modes, similar to Qwen 3. GLM-4.5 is 355 billion total parameters with 32 billion active, GLM-4.5-Air is 106 billion total parameters and 12 billion active.

They started using MIT a few months ago for their GLM-4-0414 models - their older releases used a janky non-open-source custom license.

Z.ai's own benchmarking (across 12 common benchmarks) ranked their GLM-4.5 3rd behind o3 and Grok-4 and just ahead of Claude Opus 4. They ranked GLM-4.5 Air 6th place just ahead of Claude 4 Sonnet. I haven't seen any independent benchmarks yet.

The other models they included in their own benchmarks were o4-mini (high), Gemini 2.5 Pro, Qwen3-235B-Thinking-2507, DeepSeek-R1-0528, Kimi K2, GPT-4.1, DeepSeek-V3-0324. Notably absent: any of Meta's Llama models, or any of Mistral's. Did they deliberately only compare themselves to open weight models from other Chinese AI labs?

Both models have a 128,000 context length and are trained for tool calling, which honestly feels like table stakes for any model released in 2025 at this point.

It's interesting to see them use Claude Code to run their own coding benchmarks:

To assess GLM-4.5's agentic coding capabilities, we utilized Claude Code to evaluate performance against Claude-4-Sonnet, Kimi K2, and Qwen3-Coder across 52 coding tasks spanning frontend development, tool development, data analysis, testing, and algorithm implementation. [...] The empirical results demonstrate that GLM-4.5 achieves a 53.9% win rate against Kimi K2 and exhibits dominant performance over Qwen3-Coder with an 80.8% success rate. While GLM-4.5 shows competitive performance, further optimization opportunities remain when compared to Claude-4-Sonnet.

They published the dataset for that benchmark as zai-org/CC-Bench-trajectories on Hugging Face. I think they're using the word "trajectory" for what I would call a chat transcript.

Unlike DeepSeek-V3 and Kimi K2, we reduce the width (hidden dimension and number of routed experts) of the model while increasing the height (number of layers), as we found that deeper models exhibit better reasoning capacity.

They pre-trained on 15 trillion tokens, then an additional 7 trillion for code and reasoning:

Our base model undergoes several training stages. During pre-training, the model is first trained on 15T tokens of a general pre-training corpus, followed by 7T tokens of a code & reasoning corpus. After pre-training, we introduce additional stages to further enhance the model's performance on key downstream domains.

They also open sourced their post-training reinforcement learning harness, which they've called slime. That's available at THUDM/slime on GitHub - THUDM is the Knowledge Engineer Group @ Tsinghua University, the University from which Zhipu AI spun out as an independent company.

This time I ran my pelican bechmark using the chat.z.ai chat interface, which offers free access (no account required) to both GLM 4.5 and GLM 4.5 Air. I had reasoning enabled for both.

Here's what I got for "Generate an SVG of a pelican riding a bicycle" on GLM 4.5. I like how the pelican has its wings on the handlebars:

Description by Claude Sonnet 4: This is a whimsical illustration of a white duck or goose riding a red bicycle. The bird has an orange beak and is positioned on the bike seat, with its orange webbed feet gripping what appears to be chopsticks or utensils near the handlebars. The bicycle has a simple red frame with two wheels, and there are motion lines behind it suggesting movement. The background is a soft blue-gray color, giving the image a clean, minimalist cartoon style. The overall design has a playful, humorous quality to it.

And GLM 4.5 Air:

Description by Claude Sonnet 4: This image shows a cute, minimalist illustration of a snowman riding a bicycle. The snowman has a simple design with a round white body, small black dot for an eye, and an orange rectangular nose (likely representing a carrot). The snowman appears to be in motion on a black bicycle with two wheels, with small orange arrows near the pedals suggesting movement. There are curved lines on either side of the image indicating motion or wind. The overall style is clean and whimsical, using a limited color palette of white, black, orange, and gray against a light background.

Ivan Fioravanti shared a video of the mlx-community/GLM-4.5-Air-4bit quantized model running on a M4 Mac with 128GB of RAM, and it looks like a very strong contender for a local model that can write useful code. The cheapest 128GB Mac Studio costs around $3,500 right now, so genuinely great open weight coding models are creeping closer to being affordable on consumer machines.

Update: Ivan released a 3 bit quantized version of GLM-4.5 Air which runs using 48GB of RAM on my laptop. I tried it and was really impressed, see My 2.5 year old laptop can write Space Invaders in JavaScript now.

Tags: ai, generative-ai, local-llms, llms, mlx, pelican-riding-a-bicycle, llm-reasoning, llm-release

[原文链接](https://simonwillison.net/2025/Jul/28/glm-45/#atom-everything)

进一步信息揣测

- **模型架构设计内幕**:GLM-4.5团队通过实验发现,减少模型宽度(隐藏层维度和专家路由数量)同时增加深度(层数)能显著提升推理能力,这与DeepSeek-V3和Kimi K2的设计思路不同,属于实践优化的关键结论。 - **训练数据策略**:预训练分两阶段进行,先使用15万亿通用语料,再追加7万亿代码与推理专用语料,这种分阶段针对性训练可能是性能提升的核心手段,但具体语料构成和清洗方法未公开。 - **基准测试选择性对比**:官方评测仅对比中国实验室的开源模型(如Qwen3、DeepSeek等),刻意避开了Meta的Llama和Mistral系列,可能暗示评测策略性倾斜或针对国内竞品的定位。 - **Claude Code的隐藏用途**:使用Claude Code作为评测工具而非开源方案(如HumanEval),可能因其对复杂任务评估更有效,但未说明是否因Claude的评测框架存在某种优势或合作内情。 - **强化学习工具开源**:公开的SLIME强化学习框架(来自清华团队)实为内部调优核心工具,其设计细节(如奖励函数设置、数据过滤规则)可能包含未文档化的行业级技巧。 - **模型更名背后的信号**:Zhipu AI英文名改为Z.ai,结合采用MIT许可证(替代早期非开源协议),可能反映其国际化战略调整或融资/合规需求,但未提及具体动机。 - **“轨迹”数据集的隐含价值**:发布的CC-Bench-trajectories数据集实际为对话转录,可能包含任务拆解、错误恢复等真实交互数据,这类原始数据通常需付费获取,是研究智能体行为的稀缺资源。 - **MoE架构参数未完全透明**:虽公布总参数量(如355B)和激活参数(32B),但未说明专家分配策略或稀疏性控制方法,这些细节对复现效果至关重要。 - **商业竞争情报**:评测中未包含GPT-4.1等国际顶级模型,可能因商业合作限制或避免直接暴露差距,这种选择性披露在行业内部常见但极少明言。