20250731-My_2.5_year_old_laptop_can_write_Space_Invaders_in

原文摘要

I wrote about the new GLM-4.5 model family yesterday - new open weight (MIT licensed) models from Z.ai in China which their benchmarks claim score highly in coding even against models such as Claude Sonnet 4.

The models are pretty big - the smaller GLM-4.5 Air model is still 106 billion total parameters, which is 205.78GB on Hugging Face.

Ivan Fioravanti built this 44GB 3bit quantized version for MLX, specifically sized so people with 64GB machines could have a chance of running it. I tried it out... and it works extremely well.

I fed it the following prompt:

Write an HTML and JavaScript page implementing space invaders

And it churned away for a while and produced the following:

Clearly this isn't a particularly novel example, but I still think it's noteworthy that a model running on my 2.5 year old laptop (a 64GB MacBook Pro M2) is able to produce code like this - especially code that worked first time with no further edits needed.

How I ran the model

I had to run it using the current main branch of the mlx-lm library (to ensure I had this commit adding glm4_moe support). I ran that using uv like this:

uv run \
  --with 'https://github.com/ml-explore/mlx-lm/archive/489e63376b963ac02b3b7223f778dbecc164716b.zip' \
  python

Then in that Python interpreter I used the standard recipe for running MLX models:

from mlx_lm import load, generate
model, tokenizer = load("mlx-community/GLM-4.5-Air-3bit")

That downloaded 44GB of model weights to my ~/.cache/huggingface/hub/models--mlx-community--GLM-4.5-Air-3bit folder.

Then:

prompt = "Write an HTML and JavaScript page implementing space invaders"
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True
)
response = generate(
    model, tokenizer,
    prompt=prompt,
    verbose=True,
    max_tokens=8192
)

The response started like this:

<think>The user wants me to create a Space Invaders game using HTML, CSS, and JavaScript. I need to create a complete, functional game with the following features:

  1. Player spaceship that can move left/right and shoot
  2. Enemy invaders that move in formation and shoot back
  3. Score tracking
  4. Lives/health system
  5. Game over conditions [...]

Followed by the HTML and this debugging output:

Prompt: 14 tokens, 14.095 tokens-per-sec
Generation: 4193 tokens, 25.564 tokens-per-sec
Peak memory: 47.687 GB

You can see the full transcript here, or view the source on GitHub, or try it out in your browser.

A pelican for good measure

I ran my pelican benchmark against the full sized models yesterday, but I couldn't resist trying it against this smaller 3bit model. Here's what I got for "Generate an SVG of a pelican riding a bicycle":

Blue background, pelican looks like a cloud with an orange bike, bicycle is recognizable as a bicycle if not quite the right geometry.

Here's the transcript for that.

In both cases the model used around 48GB of RAM at peak, leaving me with just 16GB for everything else - I had to quit quite a few apps in order to get the model to run but the speed was pretty good once it got going.

Local coding models are really good now

It's interesting how almost every model released in 2025 has specifically targeting coding. That focus has clearly been paying off: these coding models are getting really good now.

Two years ago when I first tried LLaMA I never dreamed that the same laptop I was using then would one day be able to run models with capabilities as strong as what I'm seeing from GLM 4.5 Air - and Mistral 3.2 Small, and Gemma 3, and Qwen 3, and a host of other high quality models that have emerged over the past six months.

    <p>Tags: <a href="https://simonwillison.net/tags/python">python</a>, <a href="https://simonwillison.net/tags/ai">ai</a>, <a href="https://simonwillison.net/tags/generative-ai">generative-ai</a>, <a href="https://simonwillison.net/tags/local-llms">local-llms</a>, <a href="https://simonwillison.net/tags/llms">llms</a>, <a href="https://simonwillison.net/tags/ai-assisted-programming">ai-assisted-programming</a>, <a href="https://simonwillison.net/tags/uv">uv</a>, <a href="https://simonwillison.net/tags/mlx">mlx</a>, <a href="https://simonwillison.net/tags/pelican-riding-a-bicycle">pelican-riding-a-bicycle</a></p>

原文链接

进一步信息揣测

  • GLM-4.5模型的量化版本性能优异:3bit量化后的GLM-4.5 Air模型(44GB)在64GB内存的M2 MacBook Pro上运行流畅,且生成代码可直接运行无需修改,说明量化技术对开源大模型的本地部署实用性有显著提升。
  • MLX生态的隐藏优势:需使用mlx-lm库的main分支(特定commit)才能支持GLM-4.5,暗示MLX对新兴模型的快速适配能力,但同时也存在版本不稳定的潜在风险(需手动指定commit而非稳定版)。
  • Hugging Face缓存机制的实际影响:模型权重默认下载到~/.cache/huggingface/hub/路径,若用户磁盘空间不足(如205GB原模型或44GB量化版),需提前规划存储或手动指定缓存路径。
  • 开源模型的商业竞争内幕:Z.ai(中国公司)发布的GLM-4.5以MIT许可证开源,却在编码任务上对标Claude Sonnet 4等闭源商业模型,可能反映中国AI团队通过开源策略抢占开发者生态的意图。
  • 本地运行大模型的硬件门槛:即使量化后,44GB模型仍需64GB内存设备,实际部署中需警惕内存交换导致的性能下降(文中未提及但属常见坑)。
  • MLX-LM的依赖管理技巧:使用uv工具直接安装GitHub特定commit的代码库(而非PyPI版本),显示边缘技术栈常需绕过标准流程才能获得最新功能。
  • 模型提示工程的隐含需求:示例中直接使用简单自然语言指令生成完整项目,但未说明是否需调整temperature等参数,实际应用中可能需多次调试才能达到“一次成功”效果。