LLM Build Leaderboard

Each game on our roadmaps is built with multiple LLMs. This page tracks how each model performs — speed, quality, cost, and bugs — so you can pick the right model for your next build.

Filter by game: 12 benchmark runs

Score	Model	Provider	Game	Time	Tokens	Cost	Bugs	Notes
95	Mistral Small	Mistral API	AI Dodge	12s	1,754	$0	0	Cleanest output — zero bugs, 12s, all features working.
90	DeepSeek V4 Flash (0.7 temp)	DeepSeek API	AI Dodge	68s	8,399	$0.0012	0	Largest output (429 lines, 15,487 chars). Full feature set.
88	owl-alpha	OpenRouter (free)	AI Dodge	53s	2,024	$0	0	Best free-tier output (8,272 chars). No bugs.
82	DeepSeek V4 Flash (0.4 temp)	DeepSeek API	AI Dodge	14s	2,077	$0.0005	3	Interactive build across 5 prompts. Fastest cloud time.
78	openai/gpt-oss-20b	Local — Mac Mini M4	AI Dodge	23s	1,823	$0	2	Best local model. Fast (23s), functional. Speed + double-R bugs.
76	Nemotron 3 Ultra	OpenRouter (free)	AI Dodge	86s	2,015	$0	1	Most compact (132 lines). May lack edge wrapping.
74	Qwen 3.5 9B	Local — Mac Mini M4	AI Dodge	190s	3,210	$0	1	Best local code quality. Clean structure, few bugs.
72	Llama 3.1 8B	Local — Mac Mini M4	AI Dodge	55s	1,220	$0	1	Most reliable local fallback. Sparse but functional.
70	Gemma-4-12b-qat	Local — Mac Mini M4	AI Dodge	281s	3,563	$0	0	Slowest by far (281s). Output adequate but not proportional to time.
68	Gemma-4-12b-coder-fable	Local — Mac Mini M4	AI Dodge	112s	1,463	$0	1	Shortest working output (3,451 chars). Missing some features.
65	DeepSeek V4 Pro	DeepSeek API	AI Dodge	101s	8,399	$0.0012	0	Spent tokens on reasoning. Only 50 lines actual game code.