We Asked 10 LLMs to Build the Same Game — Here's How Each One Did
We gave 10 different LLMs (local and cloud) the exact same prompt to build an AI dodge game in Phaser.js. We tracked tokens, code quality, bugs, and speed. The results surprised us.
What happens when you give 10 different language models the exact same prompt to build a browser game? We tested exactly that.
The task: build a Phaser.js 3 game where the player is a blue circle dodging red AI enemies. Identical prompt, same game engine, zero human edits. We fed it to cloud APIs (DeepSeek V4, Mistral, OpenRouter) and local models (GPT-OSS-20B, Llama 3.1, Gemma 4, Qwen 3.5) and compared the results.
Playable Demos
Two versions are embedded below. All other model builds are playable at their own URLs.
| Metric | Value |
|---|---|
| Model | DeepSeek V4 Flash (API) |
| Provider | DeepSeek API |
| Model ID | deepseek-v4-flash |
| Temperature | 0.7 |
| Max Tokens | 8,192 |
| Tokens Used | 8,399 (207 in / 8,192 out) |
| Build Time | 68s |
| File Size | 14.8 KB |
| Cost | $0.0012 |
| Notes | Largest original output (15,487 chars, 429 lines) |
| Source | github.com/driphtyio/ai-dodge |
| Metric | Value |
|---|---|
| Model | openai/gpt-oss-20b (local) |
| Provider | Local — Apple Mac Mini M4 (16GB) |
| Model ID | openai/gpt-oss-20b |
| Temperature | 0.7 |
| Max Tokens | 8,192 |
| Tokens Used | 1,823 (269 in / 1,570 out) |
| Build Time | 23s |
| File Size | 14.8 KB |
| Cost | $0 (local) |
| Notes | Original output had player speed bug + double R key |
| Source | github.com/driphtyio/ai-dodge |
All other model builds are playable here:
- /games/llm-built/mistral-small — Mistral Small (cleanest output)
- /games/llm-built/owl-alpha — owl-alpha (largest free output)
- /games/llm-built/nemotron-3-ultra — Nemotron 3 Ultra
- /games/llm-built/local-llama-3.1-8b-local — Llama 3.1 8B
- /games/llm-built/local-qwen-3.5-9b — Qwen 3.5 9B
- /games/llm-built/local-gemma-4-12b-qat — Gemma-4-12b-qat
- /games/llm-built/local-gemma-4-12b-coder-fable5-composer2.5 — Gemma-4-12b-coder-fable
All games: Move with WASD/arrows, survive as long as you can, press R to restart.
The Lineup
We tested 10 models across cloud APIs and local inference:
| Tier | Models |
|---|---|
| Cloud APIs | DeepSeek V4 Flash, DeepSeek V4 Pro, Mistral Small, OpenRouter (owl-alpha, Nemotron 3 Ultra) |
| Local — Apple Mac Mini M4 (16GB) | openai/gpt-oss-20b, Llama 3.1 8B, Gemma-4-12b-qat, Qwen 3.5 9B, Gemma-4-12b-coder-fable |
| Blocked by content policy | Groq (70B + 8B), Cerebras (120B) — ironic given they’re great at playing games via API |
Key Findings
- DeepSeek V4 Flash (API, 0.7 temp) produced the largest output by far — 15,487 chars, 429 lines — but cost $0.0012 and took 68 seconds
- Mistral Small was the cleanest: zero bugs, 12 seconds, 6,869 chars, $0 — the smallest cloud model outperformed everyone on first-attempt quality
- openai/gpt-oss-20b (local 20B) built a playable game in 23s at $0 but had two bugs: player speed coupled to enemy speed, and double-bound R key
- DeepSeek V4 Pro (the “better” model) underperformed Flash — it spent its 8,192 output tokens on reasoning text before the code, producing only 50 lines of actual game (1,424 chars)
- Local models (Llama 3.1 8B, Gemma 4, Qwen 3.5 9B) all produced functional games but with tradeoffs: Gemma-4-12b-qat took 281 seconds (slowest), while Gemma-4-12b-coder-fable produced only 3,451 chars (shortest working output)
- Groq and Cerebras both refused to generate game code entirely — error code 1010 (content policy) — yet their APIs work perfectly when used to play the game via the bot-brain system
- Total cost for all 10 API calls combined: $0.0032 — less than a penny for 10 working games
Results Table
| Model | Status | Time | Tokens | Lines | Chars | Cost |
|---|---|---|---|---|---|---|
| DeepSeek V4 Flash (API, 0.7) | ✅ PASS | 68s | 8,399 | 429 | 15,487 | $0.0012 |
| DeepSeek V4 Flash (0.4 temp) | ✅ PASS | 14s | 2,077 | 256 | 7,376 | $0.0005 |
| Mistral Small | ✅ PASS | 12s | 1,754 | 233 | 6,869 | $0 |
| owl-alpha (OpenRouter) | ✅ PASS | 53s | 2,024 | 241 | 8,272 | $0 |
| openai/gpt-oss-20b (local) | ✅ PASS | 23s | 1,823 | 216 | 4,849 | $0 |
| Llama 3.1 8B (local) | ✅ PASS | 55s | 1,220 | 153 | 5,183 | $0 |
| Gemma-4-12b-qat (local) | ✅ PASS | 281s | 3,563 | 182 | 5,410 | $0 |
| Qwen 3.5 9B (local) | ✅ PASS | 190s | 3,210 | 177 | 4,804 | $0 |
| Nemotron 3 Ultra (OpenRouter) | ✅ PASS | 86s | 2,015 | 132 | 5,323 | $0 |
| DeepSeek V4 Pro (API) | ✅ PASS | 101s | 8,399 | 50 | 1,424 | $0.0012 |
| Gemma-4-12b-coder-fable (local) | ✅ PASS | 112s | 1,463 | 107 | 3,451 | $0 |
| Lfm2.5-8b (local) | ⚠️ NO LOGIC | 128s | 8,369 | 551 | 25,966 | $0 |
| Vibethinker-3b (local) | ❌ OVERFLOW | 204s | 8,383 | 1 | 0 | — |
| Groq 70B / 8B | ❌ POLICY | 1s | — | — | — | — |
| Cerebras 120B | ❌ POLICY | 0.3s | — | — | — | — |
Total Cost
$0.0032 — that’s all 10 successful API calls combined. The local models cost $0 (they ran on our machine). The cloud calls (DeepSeek) cost less than a penny. AI game development is effectively free for prototyping.
Bug Breakdown by Model
| Model | Bugs | Notes |
|---|---|---|
| DeepSeek V4 Flash (0.4 temp) | Score timer stacked on each restart; no edge wrapping; no speed scaling | Fixed in second prompt. At 0.7 temp / 8,192 tokens, produced 15,487 chars (429 lines) |
| DeepSeek V4 Pro | Spent 8,192 output tokens on reasoning text before code | Only 1,424 chars (50 lines) of actual game — Flash was more efficient for code gen |
| openai/gpt-oss-20b (local) | Player speed scaled with score; double R key listener; add.circle() shapes | Fast local model (23s). Shape-based collision less precise |
| Mistral Small | None | Cleanest output — zero bugs, 12s, all features working |
| owl-alpha (OpenRouter) | None reported | Largest free output (8,272 chars). Strong free-tier contender |
| Llama 3.1 8B (local) | Sparse game logic (5,183 chars) | Functional but minimal. Most reliable local fallback |
| Gemma-4-12b-coder-fable (local) | Shortest output (3,451 chars, 107 lines) | Playable but missing some features |
| Gemma-4-12b-qat (local) | Took 281 seconds | Slowest model by far. Output adequate but not proportional to wait time |
| Qwen 3.5 9B (local) | Minor — functional first attempt (4,804 chars) | Best local model after GPT-OSS-20B |
| Nemotron 3 Ultra (OpenRouter) | Compact (132 lines) — may lack edge wrapping | Functional but minimal feature set |
| Lfm2.5-8b (local) | Produced 551 lines of HTML with zero game logic | Renders empty canvas. No generateTexture, keyboard, or game-over |
| Groq 70B (provider block) | Error 1010 — provider-level content safety filter | Llama 3.3 70B itself can generate game code (works via other providers). Groq’s API blocks any request containing game/HTML/JS generation keywords |
| Cerebras 120B (provider block) | Error 1010 — provider-level content safety filter | GPT-OSS-120B model itself works when accessed directly. Cerebras API blocks game code prompts the same way Groq does |
| Vibethinker-3b (3B) | Reasoning overflow at 204s, 8,383 tokens | Too small for this task (3B parameters) |
What Each Model Did Best
| Model | Best At |
|---|---|
| DeepSeek V4 Flash (API) | Largest output — 429 lines, full feature set |
| Mistral Small | Cleanest output — no bugs, no edits needed |
| owl-alpha | Best free-tier output — 8,272 chars, no bugs |
| openai/gpt-oss-20b | Best local model — fast (23s), functional, cheap (0 RAM cost) |
| Llama 3.1 8B | Most reliable local — always produces something functional |
| Qwen 3.5 9B | Best local code quality — clean structure, few bugs |
| Nemotron 3 Ultra | Most compact — 132 lines that all work |
| DeepSeek V4 Flash (0.4) | Fastest — 14 seconds from prompt to playable game |
The Prompt
In case you want to run your own benchmark:
“Create a complete, playable Phaser.js 3 game as a single HTML file. Player is a blue circle that moves with WASD/arrow keys. Red circles spawn from screen edges and chase the player with simple AI (seek behavior). Score increases by 1 every second. Collision with any red circle = game over. Press R to restart after game over. Enemies wrap around screen edges. Player wraps around screen edges too. Enemies speed up gradually as score increases. Player has a subtle glow/pulse effect. Dark theme. 680x480. Centered canvas with instructions below.”
Want to pit your favorite model against this? Drop it the same prompt and see if it beats Mistral Small’s clean record.
Methodology
- All models received the same prompt verbatim
- No human edits or iterations after generation
- Temperature: 0.4-0.7, max_tokens: 4096-8192
- Local models ran on 10.0.0.25:1234 (LM Studio) — Apple Mac Mini M4 (16GB memory)
- Cloud models used official API endpoints with free/free-tier keys
- Games were validated by rendering in browser and checking for core features
What’s Next
This comparison revealed something unexpected: bigger and more expensive doesn’t mean better code. Mistral Small (the smallest cloud model) produced the cleanest output. DeepSeek V4 Flash (0.7 temp) produced the most code but with more bugs. And Groq/Cerebras won’t build games at all but are happy to play them.
Upcoming posts will extend this benchmark to:
- Iterative builds — which model fixes its bugs fastest with a second prompt
- Bot performance — which LLM survives longest when playing its own game
- Larger scope — how models handle a 500-line game spec vs this 150-word prompt