Note: Certain OpenAI thinking models (o3, o4) and gpt-5 do not support temperature adjustments (only default value of 1 is supported). Models with "-reasoning-" suffix (e.g., gpt-5-2025-08-07-reasoning-medium) will use the specified reasoning effort setting.
Errata: The "inspect" test has known correctness issues but is retained in the benchmark suite to maintain consistency and fairness in scoring across all evaluated models.
| Test | pass@1 | pass@10 | Passing Samples | Errors | Actions |
|---|---|---|---|---|---|
| counter | 100% | 100% | 10/10 | 0 | |
| derived | 100% | 100% | 10/10 | 0 | |
| derived-by | 100% | 100% | 10/10 | 0 | |
| each | 90% | 100% | 9/10 | 1 | |
| effect | 100% | 100% | 10/10 | 0 | |
| hello-world | 100% | 100% | 10/10 | 0 | |
| inspect | 10% | 100% | 1/10 | 9 | |
| props | 100% | 100% | 10/10 | 0 | |
| snippets | 100% | 100% | 10/10 | 0 |