LLM Build Leaderboard
Each game on our roadmaps is built with multiple LLMs. This page tracks how each model performs — speed, quality, cost, and bugs — so you can pick the right model for your next build.
12 benchmark runs
| Score | Model | Provider | Game | Time | Tokens | Cost | Bugs | Notes |
|---|---|---|---|---|---|---|---|---|
| 95 | Mistral Small | Mistral API | AI Dodge | 12s | 1,754 | $0 | 0 | Cleanest output — zero bugs, 12s, all features working. |
| 90 | DeepSeek V4 Flash (0.7 temp) | DeepSeek API | AI Dodge | 68s | 8,399 | $0.0012 | 0 | Largest output (429 lines, 15,487 chars). Full feature set. |
| 88 | owl-alpha | OpenRouter (free) | AI Dodge | 53s | 2,024 | $0 | 0 | Best free-tier output (8,272 chars). No bugs. |
| 82 | DeepSeek V4 Flash (0.4 temp) | DeepSeek API | AI Dodge | 14s | 2,077 | $0.0005 | 3 | Interactive build across 5 prompts. Fastest cloud time. |
| 78 | openai/gpt-oss-20b | Local — Mac Mini M4 | AI Dodge | 23s | 1,823 | $0 | 2 | Best local model. Fast (23s), functional. Speed + double-R bugs. |
| 76 | Nemotron 3 Ultra | OpenRouter (free) | AI Dodge | 86s | 2,015 | $0 | 1 | Most compact (132 lines). May lack edge wrapping. |
| 74 | Qwen 3.5 9B | Local — Mac Mini M4 | AI Dodge | 190s | 3,210 | $0 | 1 | Best local code quality. Clean structure, few bugs. |
| 72 | Llama 3.1 8B | Local — Mac Mini M4 | AI Dodge | 55s | 1,220 | $0 | 1 | Most reliable local fallback. Sparse but functional. |
| 70 | Gemma-4-12b-qat | Local — Mac Mini M4 | AI Dodge | 281s | 3,563 | $0 | 0 | Slowest by far (281s). Output adequate but not proportional to time. |
| 68 | Gemma-4-12b-coder-fable | Local — Mac Mini M4 | AI Dodge | 112s | 1,463 | $0 | 1 | Shortest working output (3,451 chars). Missing some features. |
| 65 | DeepSeek V4 Pro | DeepSeek API | AI Dodge | 101s | 8,399 | $0.0012 | 0 | Spent tokens on reasoning. Only 50 lines actual game code. |