Note: Certain OpenAI thinking models (o3, o4) and gpt-5 do not support temperature adjustments (only default value of 1 is supported). Models with "-reasoning-" suffix (e.g., gpt-5-2025-08-07-reasoning-medium) will use the specified reasoning effort setting.
Errata: The "inspect" test has known correctness issues but is retained in the benchmark suite to maintain consistency and fairness in scoring across all evaluated models.
Test | pass@1 | pass@10 | Passing Samples | Errors | Actions |
---|---|---|---|---|---|
counter | 100% | 100% | 10/10 | 0 | |
derived | 100% | 100% | 10/10 | 0 | |
derived-by | 100% | 100% | 10/10 | 0 | |
each | 100% | 100% | 10/10 | 0 | |
effect | 100% | 100% | 10/10 | 0 | |
hello-world | 100% | 100% | 10/10 | 0 | |
inspect | 0% | 0% | 0/10 | 10 | |
props | 60% | 100% | 6/10 | 4 | |
snippets | 0% | 0% | 0/10 | 10 |
Test | pass@1 | pass@10 | Passing Samples | Errors | Actions |
---|---|---|---|---|---|
counter | 100% | 100% | 10/10 | 0 | |
derived | 100% | 100% | 10/10 | 0 | |
derived-by | 100% | 100% | 10/10 | 0 | |
each | 0% | 0% | 0/10 | 10 | |
effect | 100% | 100% | 10/10 | 0 | |
hello-world | 100% | 100% | 10/10 | 0 | |
inspect | 0% | 0% | 0/10 | 10 | |
props | 10% | 100% | 1/10 | 9 | |
snippets | 0% | 0% | 0/10 | 10 |
Test | pass@1 | pass@10 | Passing Samples | Errors | Actions |
---|---|---|---|---|---|
counter | 100% | 100% | 10/10 | 0 | |
derived | 100% | 100% | 10/10 | 0 | |
derived-by | 60% | 100% | 6/10 | 4 | |
each | 90% | 100% | 9/10 | 1 | |
effect | 70% | 100% | 7/10 | 6 | |
hello-world | 100% | 100% | 10/10 | 0 | |
inspect | 0% | 0% | 0/10 | 13 | |
props | 100% | 100% | 10/10 | 0 | |
snippets | 10% | 100% | 1/10 | 9 |
Test | pass@1 | pass@10 | Passing Samples | Errors | Actions |
---|---|---|---|---|---|
counter | 100% | 100% | 10/10 | 0 | |
derived | 100% | 100% | 10/10 | 0 | |
derived-by | 100% | 100% | 10/10 | 0 | |
each | 100% | 100% | 10/10 | 0 | |
effect | 100% | 100% | 10/10 | 0 | |
hello-world | 100% | 100% | 10/10 | 0 | |
inspect | 0% | 0% | 0/10 | 13 | |
props | 100% | 100% | 10/10 | 0 | |
snippets | 50% | 100% | 5/10 | 5 |
Test | pass@1 | pass@10 | Passing Samples | Errors | Actions |
---|---|---|---|---|---|
counter | 0% | 0% | 0/10 | 13 | |
derived | 0% | 0% | 0/10 | 14 | |
derived-by | 40% | 100% | 4/10 | 14 | |
each | 0% | 0% | 0/10 | 11 | |
effect | 20% | 100% | 2/10 | 14 | |
hello-world | 100% | 100% | 10/10 | 0 | |
inspect | 0% | 0% | 0/10 | 13 | |
props | 0% | 0% | 0/10 | 10 | |
snippets | 0% | 0% | 0/10 | 14 |
Test | pass@1 | pass@10 | Passing Samples | Errors | Actions |
---|---|---|---|---|---|
counter | 30% | 100% | 3/10 | 13 | |
derived | 0% | 0% | 0/10 | 11 | |
derived-by | 10% | 100% | 1/10 | 9 | |
each | 10% | 100% | 1/10 | 9 | |
effect | 0% | 0% | 0/10 | 17 | |
hello-world | 90% | 100% | 9/10 | 1 | |
inspect | 0% | 0% | 0/10 | 10 | |
props | 0% | 0% | 0/10 | 10 | |
snippets | 0% | 0% | 0/10 | 10 |
Test | pass@1 | pass@10 | Passing Samples | Errors | Actions |
---|---|---|---|---|---|
counter | 10% | 100% | 1/10 | 12 | |
derived | 0% | 0% | 0/10 | 12 | |
derived-by | 0% | 0% | 0/10 | 10 | |
each | 20% | 100% | 2/10 | 8 | |
effect | 0% | 0% | 0/10 | 11 | |
hello-world | 90% | 100% | 9/10 | 1 | |
inspect | 0% | 0% | 0/10 | 13 | |
props | 0% | 0% | 0/10 | 10 | |
snippets | 0% | 0% | 0/10 | 10 |