LLM Build Leaderboard

Each game on our roadmaps is built with multiple LLMs. This page tracks how each model performs — speed, quality, cost, and bugs — so you can pick the right model for your next build.

12 benchmark runs
Score Model Provider Game Time Tokens Cost Bugs Notes
95 Mistral Small Mistral API AI Dodge 12s 1,754 $0 0 Cleanest output — zero bugs, 12s, all features working.
90 DeepSeek V4 Flash (0.7 temp) DeepSeek API AI Dodge 68s 8,399 $0.0012 0 Largest output (429 lines, 15,487 chars). Full feature set.
88 owl-alpha OpenRouter (free) AI Dodge 53s 2,024 $0 0 Best free-tier output (8,272 chars). No bugs.
82 DeepSeek V4 Flash (0.4 temp) DeepSeek API AI Dodge 14s 2,077 $0.0005 3 Interactive build across 5 prompts. Fastest cloud time.
78 openai/gpt-oss-20b Local — Mac Mini M4 AI Dodge 23s 1,823 $0 2 Best local model. Fast (23s), functional. Speed + double-R bugs.
76 Nemotron 3 Ultra OpenRouter (free) AI Dodge 86s 2,015 $0 1 Most compact (132 lines). May lack edge wrapping.
74 Qwen 3.5 9B Local — Mac Mini M4 AI Dodge 190s 3,210 $0 1 Best local code quality. Clean structure, few bugs.
72 Llama 3.1 8B Local — Mac Mini M4 AI Dodge 55s 1,220 $0 1 Most reliable local fallback. Sparse but functional.
70 Gemma-4-12b-qat Local — Mac Mini M4 AI Dodge 281s 3,563 $0 0 Slowest by far (281s). Output adequate but not proportional to time.
68 Gemma-4-12b-coder-fable Local — Mac Mini M4 AI Dodge 112s 1,463 $0 1 Shortest working output (3,451 chars). Missing some features.
65 DeepSeek V4 Pro DeepSeek API AI Dodge 101s 8,399 $0.0012 0 Spent tokens on reasoning. Only 50 lines actual game code.