SvelteBench Visualization

Note: Certain OpenAI thinking models (o3, o4) and gpt-5 do not support temperature adjustments (only default value of 1 is supported). Models with "-reasoning-" suffix (e.g., gpt-5-2025-08-07-reasoning-medium) will use the specified reasoning effort setting.

Errata: The "inspect" test has known correctness issues but is retained in the benchmark suite to maintain consistency and fairness in scoring across all evaluated models.

Test	pass@1	pass@10	Passing Samples	Errors	Actions
counter	100%	100%	10/10	0	Prompt Tests
derived	100%	100%	10/10	0	Prompt Tests
derived-by	100%	100%	10/10	0	Prompt Tests
each	100%	100%	10/10	0	Prompt Tests
effect	100%	100%	10/10	0	Prompt Tests
hello-world	100%	100%	10/10	0	Prompt Tests
inspect	0%	0%	0/10	10	Prompt Tests
props	60%	100%	6/10	4	Prompt Tests
snippets	0%	0%	0/10	10	Prompt Tests

Test	pass@1	pass@10	Passing Samples	Errors	Actions
counter	100%	100%	10/10	0	Prompt Tests
derived	100%	100%	10/10	0	Prompt Tests
derived-by	100%	100%	10/10	0	Prompt Tests
each	0%	0%	0/10	10	Prompt Tests
effect	100%	100%	10/10	0	Prompt Tests
hello-world	100%	100%	10/10	0	Prompt Tests
inspect	0%	0%	0/10	10	Prompt Tests
props	10%	100%	1/10	9	Prompt Tests
snippets	0%	0%	0/10	10	Prompt Tests

Test	pass@1	pass@10	Passing Samples	Errors	Actions
counter	100%	100%	10/10	0	Prompt Tests
derived	100%	100%	10/10	0	Prompt Tests
derived-by	60%	100%	6/10	4	Prompt Tests
each	90%	100%	9/10	1	Prompt Tests
effect	70%	100%	7/10	6	Prompt Tests
hello-world	100%	100%	10/10	0	Prompt Tests
inspect	0%	0%	0/10	13	Prompt Tests
props	100%	100%	10/10	0	Prompt Tests
snippets	10%	100%	1/10	9	Prompt Tests

Test	pass@1	pass@10	Passing Samples	Errors	Actions
counter	100%	100%	10/10	0	Prompt Tests
derived	100%	100%	10/10	0	Prompt Tests
derived-by	100%	100%	10/10	0	Prompt Tests
each	100%	100%	10/10	0	Prompt Tests
effect	100%	100%	10/10	0	Prompt Tests
hello-world	100%	100%	10/10	0	Prompt Tests
inspect	0%	0%	0/10	13	Prompt Tests
props	100%	100%	10/10	0	Prompt Tests
snippets	50%	100%	5/10	5	Prompt Tests

Test	pass@1	pass@10	Passing Samples	Errors	Actions
counter	0%	0%	0/10	13	Prompt Tests
derived	0%	0%	0/10	14	Prompt Tests
derived-by	40%	100%	4/10	14	Prompt Tests
each	0%	0%	0/10	11	Prompt Tests
effect	20%	100%	2/10	14	Prompt Tests
hello-world	100%	100%	10/10	0	Prompt Tests
inspect	0%	0%	0/10	13	Prompt Tests
props	0%	0%	0/10	10	Prompt Tests
snippets	0%	0%	0/10	14	Prompt Tests

Test	pass@1	pass@10	Passing Samples	Errors	Actions
counter	30%	100%	3/10	13	Prompt Tests
derived	0%	0%	0/10	11	Prompt Tests
derived-by	10%	100%	1/10	9	Prompt Tests
each	10%	100%	1/10	9	Prompt Tests
effect	0%	0%	0/10	17	Prompt Tests
hello-world	90%	100%	9/10	1	Prompt Tests
inspect	0%	0%	0/10	10	Prompt Tests
props	0%	0%	0/10	10	Prompt Tests
snippets	0%	0%	0/10	10	Prompt Tests

Test	pass@1	pass@10	Passing Samples	Errors	Actions
counter	10%	100%	1/10	12	Prompt Tests
derived	0%	0%	0/10	12	Prompt Tests
derived-by	0%	0%	0/10	10	Prompt Tests
each	20%	100%	2/10	8	Prompt Tests
effect	0%	0%	0/10	11	Prompt Tests
hello-world	90%	100%	9/10	1	Prompt Tests
inspect	0%	0%	0/10	13	Prompt Tests
props	0%	0%	0/10	10	Prompt Tests
snippets	0%	0%	0/10	10	Prompt Tests