SvelteBench Visualization

Note: Certain OpenAI thinking models (o3, o4) and gpt-5 do not support temperature adjustments (only default value of 1 is supported). Models with "-reasoning-" suffix (e.g., gpt-5-2025-08-07-reasoning-medium) will use the specified reasoning effort setting.

Errata: The "inspect" test has known correctness issues but is retained in the benchmark suite to maintain consistency and fairness in scoring across all evaluated models.

Test pass@1 pass@10 Passing Samples Errors Actions
counter 80% 100% 8/10 2
derived 80% 100% 8/10 3
derived-by 60% 100% 6/10 4
each 100% 100% 10/10 0
effect 80% 100% 8/10 3
hello-world 100% 100% 10/10 0
inspect 0% 0% 0/10 16
props 20% 100% 2/10 11
snippets 0% 0% 0/10 14