SvelteBench Visualization

🏆 Top Models Leaderboard

Average pass@1 scores
🔍
Rank Model Score
1 claude-opus-4-1-20250805 (Anthropic)
88.9%
2 claude-opus-4-20250514 (Anthropic)
88.9%
3 claude-sonnet-4-20250514 (Anthropic)
88.9%
4 x-ai/grok-4 (OpenRouter)
87.8%
5 moonshotai/kimi-k2 (OpenRouter)
84.4%
6 gemini-2.5-pro (Google)
84.4%
7 gemini-2.5-pro-preview-03-25 (Google)
83.3%
8 x-ai/grok-3-beta (OpenRouter)
83.3%
9 gemini-2.5-pro-preview-06-05 (Google)
81.1%
10 x-ai/grok-3 (OpenRouter)
81.1%
11 gemini-2.5-pro-preview-05-06 (Google)
80.0%
12 glm-4.5-x (Z.ai)
80.0%
13 gpt-5-2025-08-07 (OpenAI)
78.9%
14 glm-4.5 (Z.ai)
78.9%
15 gpt-5-2025-08-07-reasoning-medium (OpenAI)
77.8%
16 openrouter/horizon-alpha (OpenRouter)
77.8%
17 openrouter/horizon-beta (OpenRouter)
76.7%
18 claude-3-5-haiku-20241022 (Anthropic)
73.3%
19 mistralai/mistral-medium-3 (OpenRouter)
72.2%
20 gemini-2.5-flash (Google)
71.1%
21 gemini-2.5-flash-preview-04-17 (Google)
70.0%
22 qwen/qwen3-coder (OpenRouter)
67.8%
23 x-ai/grok-3-mini-beta (OpenRouter)
67.8%
24 z-ai/glm-4.5 (OpenRouter)
66.7%
25 x-ai/grok-3-mini (OpenRouter)
64.4%
26 meta-llama/llama-4-maverick (OpenRouter)
64.4%
27 mistralai/codestral-2508 (OpenRouter)
58.9%
28 glm-4.5-air (Z.ai)
58.9%
29 qwen/qwen3-235b-a22b-07-25 (OpenRouter)
57.8%
30 qwen/qwen3-235b-a22b-thinking-2507 (OpenRouter)
57.8%
31 z-ai/glm-4.5-air (OpenRouter)
57.8%
32 claude-3-7-sonnet-20250219 (Anthropic)
56.7%
33 glm-4.5-airx (Z.ai)
55.6%
34 mistralai/devstral-medium (OpenRouter)
52.2%
35 deepseek/deepseek-r1-0528 (OpenRouter)
48.9%
36 gemini-2.5-flash-lite (Google)
48.9%
37 z-ai/glm-4-32b (OpenRouter)
46.7%
38 glm-4-32b-0414-128k (Z.ai)
44.4%
39 mistralai/mistral-medium-3.1 (OpenRouter)
41.1%
40 openai/gpt-oss-120b (OpenRouter)
35.6%
41 qwen/qwen3-30b-a3b (OpenRouter)
34.4%
42 o3-2025-04-16 (OpenAI)
30.0%
43 chatgpt-4o-latest (OpenAI)
25.6%
44 gpt-4.1-2025-04-14 (OpenAI)
22.2%
45 gpt-5-mini-2025-08-07 (OpenAI)
21.1%
46 openai/gpt-oss-20b (OpenRouter)
20.0%
47 gpt-4o-2024-08-06 (OpenAI)
17.8%
48 gpt-5-nano-2025-08-07 (OpenAI)
16.7%
49 o3-mini-2025-01-31 (OpenAI)
15.6%
50 meta-llama/llama-4-scout (OpenRouter)
15.6%
51 o4-mini-2025-04-16 (OpenAI)
13.3%
52 mistralai/devstral-small (OpenRouter)
13.3%
53 gemma-3-27b-it (Google)
11.1%
54 gpt-4.1-nano-2025-04-14 (OpenAI)
11.1%
55 o1-pro-2025-03-19 (OpenAI)
11.1%
56 baidu/ernie-4.5-21b-a3b (OpenRouter)
11.1%
57 qwen/qwen3-30b-a3b-instruct-2507 (OpenRouter)
10.0%
58 ai21/jamba-large-1.7 (OpenRouter)
8.9%
59 ai21/jamba-mini-1.7 (OpenRouter)
8.9%
60 google/gemma-3n-e4b-it (OpenRouter)
8.9%
61 moonshotai/kimi-dev-72b:free (OpenRouter)
3.3%
62 gpt-4.1-mini-2025-04-14 (OpenAI)
1.1%

Note: Certain OpenAI thinking models (o3, o4) and gpt-5 do not support temperature adjustments (only default value of 1 is supported). Models with "-reasoning-" suffix (e.g., gpt-5-2025-08-07-reasoning-medium) will use the specified reasoning effort setting.

Errata: The "inspect" test has known correctness issues but is retained in the benchmark suite to maintain consistency and fairness in scoring across all evaluated models.

Anthropic

claude-3-5-haiku-20241022

Test Status pass@1 pass@10 Passing Samples Errors Actions
counter ✅ PASS 1.0000 1.0000 10/10 0
derived ✅ PASS 1.0000 1.0000 10/10 0
derived-by ✅ PASS 1.0000 1.0000 10/10 0
each ✅ PASS 1.0000 1.0000 10/10 0
effect ✅ PASS 1.0000 1.0000 10/10 0
hello-world ✅ PASS 1.0000 1.0000 10/10 0
inspect ❌ FAIL 0.0000 0.0000 0/10 10
props ⚠️ PARTIAL 0.6000 1.0000 6/10 4
snippets ❌ FAIL 0.0000 0.0000 0/10 10

claude-3-7-sonnet-20250219

Test Status pass@1 pass@10 Passing Samples Errors Actions
counter ✅ PASS 1.0000 1.0000 10/10 0
derived ✅ PASS 1.0000 1.0000 10/10 0
derived-by ✅ PASS 1.0000 1.0000 10/10 0
each ❌ FAIL 0.0000 0.0000 0/10 10
effect ✅ PASS 1.0000 1.0000 10/10 0
hello-world ✅ PASS 1.0000 1.0000 10/10 0
inspect ❌ FAIL 0.0000 0.0000 0/10 10
props ⚠️ PARTIAL 0.1000 1.0000 1/10 9
snippets ❌ FAIL 0.0000 0.0000 0/10 10

claude-opus-4-1-20250805

Test Status pass@1 pass@10 Passing Samples Errors Actions
counter ✅ PASS 1.0000 1.0000 10/10 0
derived ✅ PASS 1.0000 1.0000 10/10 0
derived-by ✅ PASS 1.0000 1.0000 10/10 0
each ✅ PASS 1.0000 1.0000 10/10 0
effect ✅ PASS 1.0000 1.0000 10/10 0
hello-world ✅ PASS 1.0000 1.0000 10/10 0
inspect ❌ FAIL 0.0000 0.0000 0/10 10
props ✅ PASS 1.0000 1.0000 10/10 0
snippets ✅ PASS 1.0000 1.0000 10/10 0

claude-opus-4-20250514

Test Status pass@1 pass@10 Passing Samples Errors Actions
counter ✅ PASS 1.0000 1.0000 10/10 0
derived ✅ PASS 1.0000 1.0000 10/10 0
derived-by ✅ PASS 1.0000 1.0000 10/10 0
each ✅ PASS 1.0000 1.0000 10/10 0
effect ✅ PASS 1.0000 1.0000 10/10 0
hello-world ✅ PASS 1.0000 1.0000 10/10 0
inspect ❌ FAIL 0.0000 0.0000 0/10 10
props ✅ PASS 1.0000 1.0000 10/10 0
snippets ✅ PASS 1.0000 1.0000 10/10 0

claude-sonnet-4-20250514

Test Status pass@1 pass@10 Passing Samples Errors Actions
counter ✅ PASS 1.0000 1.0000 10/10 0
derived ✅ PASS 1.0000 1.0000 10/10 0
derived-by ✅ PASS 1.0000 1.0000 10/10 0
each ✅ PASS 1.0000 1.0000 10/10 0
effect ✅ PASS 1.0000 1.0000 10/10 0
hello-world ✅ PASS 1.0000 1.0000 10/10 0
inspect ❌ FAIL 0.0000 0.0000 0/10 10
props ✅ PASS 1.0000 1.0000 10/10 0
snippets ✅ PASS 1.0000 1.0000 10/10 0

Google

gemini-2.5-flash

Test Status pass@1 pass@10 Passing Samples Errors Actions
counter ✅ PASS 1.0000 1.0000 10/10 0
derived ✅ PASS 1.0000 1.0000 10/10 0
derived-by ✅ PASS 1.0000 1.0000 10/10 0
each ⚠️ PARTIAL 0.4000 1.0000 4/10 6
effect ⚠️ PARTIAL 0.8000 1.0000 8/10 4
hello-world ✅ PASS 1.0000 1.0000 10/10 0
inspect ❌ FAIL 0.0000 0.0000 0/10 10
props ✅ PASS 1.0000 1.0000 10/10 0
snippets ⚠️ PARTIAL 0.2000 1.0000 2/10 8

gemini-2.5-flash-lite

Test Status pass@1 pass@10 Passing Samples Errors Actions
counter ⚠️ PARTIAL 0.7000 1.0000 7/10 3
derived ⚠️ PARTIAL 0.5000 1.0000 5/10 10
derived-by ✅ PASS 1.0000 1.0000 10/10 0
each ⚠️ PARTIAL 0.1000 1.0000 1/10 9
effect ⚠️ PARTIAL 0.8000 1.0000 8/10 4
hello-world ✅ PASS 1.0000 1.0000 10/10 0
inspect ❌ FAIL 0.0000 0.0000 0/10 16
props ⚠️ PARTIAL 0.3000 1.0000 3/10 7
snippets ❌ FAIL 0.0000 0.0000 0/10 10

gemini-2.5-flash-preview-04-17

Test Status pass@1 pass@10 Passing Samples Errors Actions
counter ✅ PASS 1.0000 1.0000 10/10 0
derived ✅ PASS 1.0000 1.0000 10/10 0
derived-by ⚠️ PARTIAL 0.6000 1.0000 6/10 4
each ⚠️ PARTIAL 0.9000 1.0000 9/10 1
effect ⚠️ PARTIAL 0.7000 1.0000 7/10 6
hello-world ✅ PASS 1.0000 1.0000 10/10 0
inspect ❌ FAIL 0.0000 0.0000 0/10 13
props ✅ PASS 1.0000 1.0000 10/10 0
snippets ⚠️ PARTIAL 0.1000 1.0000 1/10 9

gemini-2.5-pro

Test Status pass@1 pass@10 Passing Samples Errors Actions
counter ✅ PASS 1.0000 1.0000 10/10 0
derived ✅ PASS 1.0000 1.0000 10/10 0
derived-by ✅ PASS 1.0000 1.0000 10/10 0
each ✅ PASS 1.0000 1.0000 10/10 0
effect ✅ PASS 1.0000 1.0000 10/10 0
hello-world ✅ PASS 1.0000 1.0000 10/10 0
inspect ❌ FAIL 0.0000 0.0000 0/10 10
props ✅ PASS 1.0000 1.0000 10/10 0
snippets ⚠️ PARTIAL 0.6000 1.0000 6/10 4

gemini-2.5-pro-preview-03-25

Test Status pass@1 pass@10 Passing Samples Errors Actions
counter ✅ PASS 1.0000 1.0000 10/10 0
derived ✅ PASS 1.0000 1.0000 10/10 0
derived-by ✅ PASS 1.0000 1.0000 10/10 0
each ✅ PASS 1.0000 1.0000 10/10 0
effect ✅ PASS 1.0000 1.0000 10/10 0
hello-world ✅ PASS 1.0000 1.0000 10/10 0
inspect ❌ FAIL 0.0000 0.0000 0/10 13
props ✅ PASS 1.0000 1.0000 10/10 0
snippets ⚠️ PARTIAL 0.5000 1.0000 5/10 5

gemini-2.5-pro-preview-05-06

Test Status pass@1 pass@10 Passing Samples Errors Actions
counter ✅ PASS 1.0000 1.0000 10/10 0
derived ✅ PASS 1.0000 1.0000 10/10 0
derived-by ⚠️ PARTIAL 0.9000 1.0000 9/10 1
each ✅ PASS 1.0000 1.0000 10/10 0
effect ⚠️ PARTIAL 0.8000 1.0000 8/10 4
hello-world ✅ PASS 1.0000 1.0000 10/10 0
inspect ❌ FAIL 0.0000 0.0000 0/10 13
props ⚠️ PARTIAL 0.9000 1.0000 9/10 1
snippets ⚠️ PARTIAL 0.6000 1.0000 6/10 4

gemini-2.5-pro-preview-06-05

Test Status pass@1 pass@10 Passing Samples Errors Actions
counter ✅ PASS 1.0000 1.0000 10/10 0
derived ✅ PASS 1.0000 1.0000 10/10 0
derived-by ✅ PASS 1.0000 1.0000 10/10 0
each ✅ PASS 1.0000 1.0000 10/10 0
effect ⚠️ PARTIAL 0.9000 1.0000 9/10 2
hello-world ✅ PASS 1.0000 1.0000 10/10 0
inspect ❌ FAIL 0.0000 0.0000 0/10 10
props ✅ PASS 1.0000 1.0000 10/10 0
snippets ⚠️ PARTIAL 0.4000 1.0000 4/10 6

gemma-3-27b-it

Test Status pass@1 pass@10 Passing Samples Errors Actions
counter ❌ FAIL 0.0000 0.0000 0/10 10
derived ❌ FAIL 0.0000 0.0000 0/10 10
derived-by ❌ FAIL 0.0000 0.0000 0/10 10
each ❌ FAIL 0.0000 0.0000 0/10 10
effect ❌ FAIL 0.0000 0.0000 0/10 10
hello-world ✅ PASS 1.0000 1.0000 10/10 0
inspect ❌ FAIL 0.0000 0.0000 0/10 10
props ❌ FAIL 0.0000 0.0000 0/10 10
snippets ❌ FAIL 0.0000 0.0000 0/10 10

OpenAI

chatgpt-4o-latest

Test Status pass@1 pass@10 Passing Samples Errors Actions
counter ⚠️ PARTIAL 0.7000 1.0000 7/10 3
derived ❌ FAIL 0.0000 0.0000 0/10 10
derived-by ⚠️ PARTIAL 0.1000 1.0000 1/10 9
each ⚠️ PARTIAL 0.4000 1.0000 4/10 6
effect ❌ FAIL 0.0000 0.0000 0/10 13
hello-world ✅ PASS 1.0000 1.0000 10/10 0
inspect ❌ FAIL 0.0000 0.0000 0/10 13
props ❌ FAIL 0.0000 0.0000 0/10 10
snippets ⚠️ PARTIAL 0.1000 1.0000 1/10 9

gpt-4.1-2025-04-14

Test Status pass@1 pass@10 Passing Samples Errors Actions
counter ⚠️ PARTIAL 0.7000 1.0000 7/10 3
derived ❌ FAIL 0.0000 0.0000 0/10 10
derived-by ❌ FAIL 0.0000 0.0000 0/10 10
each ⚠️ PARTIAL 0.2000 1.0000 2/10 9
effect ⚠️ PARTIAL 0.1000 1.0000 1/10 10
hello-world ✅ PASS 1.0000 1.0000 10/10 0
inspect ❌ FAIL 0.0000 0.0000 0/10 19
props ❌ FAIL 0.0000 0.0000 0/10 10
snippets ❌ FAIL 0.0000 0.0000 0/10 10

gpt-4.1-mini-2025-04-14

Test Status pass@1 pass@10 Passing Samples Errors Actions
counter ❌ FAIL 0.0000 0.0000 0/10 13
derived ❌ FAIL 0.0000 0.0000 0/10 13
derived-by ❌ FAIL 0.0000 0.0000 0/10 12
each ⚠️ PARTIAL 0.1000 1.0000 1/10 10
effect ❌ FAIL 0.0000 0.0000 0/10 16
hello-world ❌ FAIL 0.0000 0.0000 0/10 10
inspect ❌ FAIL 0.0000 0.0000 0/10 28
props ❌ FAIL 0.0000 0.0000 0/10 10
snippets ❌ FAIL 0.0000 0.0000 0/10 10

gpt-4.1-nano-2025-04-14

Test Status pass@1 pass@10 Passing Samples Errors Actions
counter ❌ FAIL 0.0000 0.0000 0/10 14
derived ❌ FAIL 0.0000 0.0000 0/10 13
derived-by ❌ FAIL 0.0000 0.0000 0/10 10
each ❌ FAIL 0.0000 0.0000 0/10 10
effect ❌ FAIL 0.0000 0.0000 0/10 13
hello-world ✅ PASS 1.0000 1.0000 10/10 0
inspect ❌ FAIL 0.0000 0.0000 0/10 10
props ❌ FAIL 0.0000 0.0000 0/10 10
snippets ❌ FAIL 0.0000 0.0000 0/10 10

gpt-4o-2024-08-06

Test Status pass@1 pass@10 Passing Samples Errors Actions
counter ❌ FAIL 0.0000 0.0000 0/10 13
derived ❌ FAIL 0.0000 0.0000 0/10 14
derived-by ⚠️ PARTIAL 0.4000 1.0000 4/10 14
each ❌ FAIL 0.0000 0.0000 0/10 11
effect ⚠️ PARTIAL 0.2000 1.0000 2/10 14
hello-world ✅ PASS 1.0000 1.0000 10/10 0
inspect ❌ FAIL 0.0000 0.0000 0/10 13
props ❌ FAIL 0.0000 0.0000 0/10 10
snippets ❌ FAIL 0.0000 0.0000 0/10 14

gpt-5-2025-08-07

Test Status pass@1 pass@10 Passing Samples Errors Actions
counter ✅ PASS 1.0000 1.0000 10/10 0
derived ⚠️ PARTIAL 0.9000 1.0000 9/10 1
derived-by ⚠️ PARTIAL 0.7000 1.0000 7/10 5
each ✅ PASS 1.0000 1.0000 10/10 0
effect ⚠️ PARTIAL 0.7000 1.0000 7/10 3
hello-world ✅ PASS 1.0000 1.0000 10/10 0
inspect ❌ FAIL 0.0000 0.0000 0/10 25
props ⚠️ PARTIAL 0.9000 1.0000 9/10 4
snippets ⚠️ PARTIAL 0.9000 1.0000 9/10 3

gpt-5-2025-08-07-reasoning-medium

Test Status pass@1 pass@10 Passing Samples Errors Actions
counter ✅ PASS 1.0000 1.0000 10/10 0
derived ⚠️ PARTIAL 0.7000 1.0000 7/10 3
derived-by ⚠️ PARTIAL 0.6000 1.0000 6/10 6
each ✅ PASS 1.0000 1.0000 10/10 0
effect ⚠️ PARTIAL 0.7000 1.0000 7/10 3
hello-world ✅ PASS 1.0000 1.0000 10/10 0
inspect ❌ FAIL 0.0000 0.0000 0/10 16
props ✅ PASS 1.0000 1.0000 10/10 0
snippets ✅ PASS 1.0000 1.0000 10/10 0

gpt-5-mini-2025-08-07

Test Status pass@1 pass@10 Passing Samples Errors Actions
counter ⚠️ PARTIAL 0.4000 1.0000 4/10 6
derived ❌ FAIL 0.0000 0.0000 0/10 11
derived-by ❌ FAIL 0.0000 0.0000 0/10 12
each ⚠️ PARTIAL 0.5000 1.0000 5/10 5
effect ⚠️ PARTIAL 0.1000 1.0000 1/10 12
hello-world ⚠️ PARTIAL 0.9000 1.0000 9/10 1
inspect ❌ FAIL 0.0000 0.0000 0/10 13
props ❌ FAIL 0.0000 0.0000 0/10 10
snippets ❌ FAIL 0.0000 0.0000 0/10 10

gpt-5-nano-2025-08-07

Test Status pass@1 pass@10 Passing Samples Errors Actions
counter ⚠️ PARTIAL 0.5000 1.0000 5/10 8
derived ❌ FAIL 0.0000 0.0000 0/10 13
derived-by ❌ FAIL 0.0000 0.0000 0/10 10
each ❌ FAIL 0.0000 0.0000 0/10 10
effect ❌ FAIL 0.0000 0.0000 0/10 10
hello-world ✅ PASS 1.0000 1.0000 10/10 0
inspect ❌ FAIL 0.0000 0.0000 0/10 10
props ❌ FAIL 0.0000 0.0000 0/10 13
snippets ❌ FAIL 0.0000 0.0000 0/10 24

o1-pro-2025-03-19

Test Status pass@1 pass@10 Passing Samples Errors Actions
counter ❌ FAIL 0.0000 0.0000 0/1 1
derived ❌ FAIL 0.0000 0.0000 0/1 1
derived-by ❌ FAIL 0.0000 0.0000 0/1 3
each ❌ FAIL 0.0000 0.0000 0/1 1
effect ❌ FAIL 0.0000 0.0000 0/1 1
hello-world ✅ PASS 1.0000 1.0000 1/1 0
inspect ❌ FAIL 0.0000 0.0000 0/1 1
props ❌ FAIL 0.0000 0.0000 0/1 1
snippets ❌ FAIL 0.0000 0.0000 0/1 1

o3-2025-04-16

Test Status pass@1 pass@10 Passing Samples Errors Actions
counter ⚠️ PARTIAL 0.6000 1.0000 6/10 4
derived ⚠️ PARTIAL 0.1000 1.0000 1/10 11
derived-by ⚠️ PARTIAL 0.2000 1.0000 2/10 8
each ⚠️ PARTIAL 0.6000 1.0000 6/10 4
effect ⚠️ PARTIAL 0.2000 1.0000 2/10 11
hello-world ✅ PASS 1.0000 1.0000 10/10 0
inspect ❌ FAIL 0.0000 0.0000 0/10 13
props ❌ FAIL 0.0000 0.0000 0/10 10
snippets ❌ FAIL 0.0000 0.0000 0/10 10

o3-mini-2025-01-31

Test Status pass@1 pass@10 Passing Samples Errors Actions
counter ⚠️ PARTIAL 0.3000 1.0000 3/10 13
derived ❌ FAIL 0.0000 0.0000 0/10 11
derived-by ⚠️ PARTIAL 0.1000 1.0000 1/10 9
each ⚠️ PARTIAL 0.1000 1.0000 1/10 9
effect ❌ FAIL 0.0000 0.0000 0/10 17
hello-world ⚠️ PARTIAL 0.9000 1.0000 9/10 1
inspect ❌ FAIL 0.0000 0.0000 0/10 10
props ❌ FAIL 0.0000 0.0000 0/10 10
snippets ❌ FAIL 0.0000 0.0000 0/10 10

o4-mini-2025-04-16

Test Status pass@1 pass@10 Passing Samples Errors Actions
counter ⚠️ PARTIAL 0.1000 1.0000 1/10 12
derived ❌ FAIL 0.0000 0.0000 0/10 12
derived-by ❌ FAIL 0.0000 0.0000 0/10 10
each ⚠️ PARTIAL 0.2000 1.0000 2/10 8
effect ❌ FAIL 0.0000 0.0000 0/10 11
hello-world ⚠️ PARTIAL 0.9000 1.0000 9/10 1
inspect ❌ FAIL 0.0000 0.0000 0/10 13
props ❌ FAIL 0.0000 0.0000 0/10 10
snippets ❌ FAIL 0.0000 0.0000 0/10 10

OpenRouter

ai21/jamba-large-1.7

Test Status pass@1 pass@10 Passing Samples Errors Actions
counter ❌ FAIL 0.0000 0.0000 0/10 10
derived ❌ FAIL 0.0000 0.0000 0/10 11
derived-by ❌ FAIL 0.0000 0.0000 0/10 10
each ❌ FAIL 0.0000 0.0000 0/10 10
effect ❌ FAIL 0.0000 0.0000 0/10 10
hello-world ⚠️ PARTIAL 0.8000 1.0000 8/10 2
inspect ❌ FAIL 0.0000 0.0000 0/10 10
props ❌ FAIL 0.0000 0.0000 0/10 10
snippets ❌ FAIL 0.0000 0.0000 0/10 10

ai21/jamba-mini-1.7

Test Status pass@1 pass@10 Passing Samples Errors Actions
counter ❌ FAIL 0.0000 0.0000 0/10 10
derived ❌ FAIL 0.0000 0.0000 0/10 10
derived-by ❌ FAIL 0.0000 0.0000 0/10 10
each ❌ FAIL 0.0000 0.0000 0/10 10
effect ❌ FAIL 0.0000 0.0000 0/10 10
hello-world ⚠️ PARTIAL 0.8000 1.0000 8/10 2
inspect ❌ FAIL 0.0000 0.0000 0/10 10
props ❌ FAIL 0.0000 0.0000 0/10 10
snippets ❌ FAIL 0.0000 0.0000 0/10 10

baidu/ernie-4.5-21b-a3b

Test Status pass@1 pass@10 Passing Samples Errors Actions
counter ❌ FAIL 0.0000 0.0000 0/10 13
derived ❌ FAIL 0.0000 0.0000 0/10 11
derived-by ❌ FAIL 0.0000 0.0000 0/10 10
each ❌ FAIL 0.0000 0.0000 0/10 10
effect ❌ FAIL 0.0000 0.0000 0/10 11
hello-world ✅ PASS 1.0000 1.0000 10/10 0
inspect ❌ FAIL 0.0000 0.0000 0/10 10
props ❌ FAIL 0.0000 0.0000 0/10 10
snippets ❌ FAIL 0.0000 0.0000 0/10 10

deepseek/deepseek-r1-0528

Test Status pass@1 pass@10 Passing Samples Errors Actions
counter ⚠️ PARTIAL 0.3000 1.0000 3/10 22
derived ⚠️ PARTIAL 0.6000 1.0000 6/10 6
derived-by ⚠️ PARTIAL 0.9000 1.0000 9/10 1
each ⚠️ PARTIAL 0.2000 1.0000 2/10 10
effect ✅ PASS 1.0000 1.0000 10/10 0
hello-world ✅ PASS 1.0000 1.0000 10/10 0
inspect ❌ FAIL 0.0000 0.0000 0/10 10
props ⚠️ PARTIAL 0.4000 1.0000 4/10 6
snippets ❌ FAIL 0.0000 0.0000 0/10 10

google/gemma-3n-e4b-it

Test Status pass@1 pass@10 Passing Samples Errors Actions
counter ❌ FAIL 0.0000 0.0000 0/10 19
derived ❌ FAIL 0.0000 0.0000 0/10 10
derived-by ❌ FAIL 0.0000 0.0000 0/10 10
each ❌ FAIL 0.0000 0.0000 0/10 10
effect ❌ FAIL 0.0000 0.0000 0/10 10
hello-world ⚠️ PARTIAL 0.8000 1.0000 8/10 2
inspect ❌ FAIL 0.0000 0.0000 0/10 10
props ❌ FAIL 0.0000 0.0000 0/10 10
snippets ❌ FAIL 0.0000 0.0000 0/10 10

meta-llama/llama-4-maverick

Test Status pass@1 pass@10 Passing Samples Errors Actions
counter ✅ PASS 1.0000 1.0000 10/10 0
derived ✅ PASS 1.0000 1.0000 10/10 0
derived-by ✅ PASS 1.0000 1.0000 10/10 0
each ✅ PASS 1.0000 1.0000 10/10 0
effect ❌ FAIL 0.0000 0.0000 0/10 10
hello-world ⚠️ PARTIAL 0.8000 1.0000 8/10 2
inspect ❌ FAIL 0.0000 0.0000 0/10 13
props ✅ PASS 1.0000 1.0000 10/10 0
snippets ❌ FAIL 0.0000 0.0000 0/10 10

meta-llama/llama-4-scout

Test Status pass@1 pass@10 Passing Samples Errors Actions
counter ⚠️ PARTIAL 0.9000 1.0000 9/10 1
derived ⚠️ PARTIAL 0.2000 1.0000 2/10 8
derived-by ❌ FAIL 0.0000 0.0000 0/10 11
each ❌ FAIL 0.0000 0.0000 0/10 11
effect ❌ FAIL 0.0000 0.0000 0/10 10
hello-world ⚠️ PARTIAL 0.3000 1.0000 3/10 11
inspect ❌ FAIL 0.0000 0.0000 0/10 10
props ❌ FAIL 0.0000 0.0000 0/10 10
snippets ❌ FAIL 0.0000 0.0000 0/10 10

mistralai/codestral-2508

Test Status pass@1 pass@10 Passing Samples Errors Actions
counter ✅ PASS 1.0000 1.0000 10/10 0
derived ✅ PASS 1.0000 1.0000 10/10 0
derived-by ⚠️ PARTIAL 0.6000 1.0000 6/10 4
each ⚠️ PARTIAL 0.7000 1.0000 7/10 4
effect ✅ PASS 1.0000 1.0000 10/10 0
hello-world ⚠️ PARTIAL 0.9000 1.0000 9/10 1
inspect ❌ FAIL 0.0000 0.0000 0/10 22
props ⚠️ PARTIAL 0.1000 1.0000 1/10 9
snippets ❌ FAIL 0.0000 0.0000 0/10 10

mistralai/devstral-medium

Test Status pass@1 pass@10 Passing Samples Errors Actions
counter ✅ PASS 1.0000 1.0000 10/10 0
derived ⚠️ PARTIAL 0.8000 1.0000 8/10 4
derived-by ⚠️ PARTIAL 0.9000 1.0000 9/10 1
each ⚠️ PARTIAL 0.1000 1.0000 1/10 9
effect ✅ PASS 1.0000 1.0000 10/10 0
hello-world ⚠️ PARTIAL 0.9000 1.0000 9/10 1
inspect ❌ FAIL 0.0000 0.0000 0/10 19
props ❌ FAIL 0.0000 0.0000 0/10 10
snippets ❌ FAIL 0.0000 0.0000 0/10 10

mistralai/devstral-small

Test Status pass@1 pass@10 Passing Samples Errors Actions
counter ⚠️ PARTIAL 0.7000 1.0000 7/10 6
derived ⚠️ PARTIAL 0.1000 1.0000 1/10 13
derived-by ⚠️ PARTIAL 0.1000 1.0000 1/10 13
each ❌ FAIL 0.0000 0.0000 0/10 10
effect ❌ FAIL 0.0000 0.0000 0/10 10
hello-world ⚠️ PARTIAL 0.3000 1.0000 3/10 7
inspect ❌ FAIL 0.0000 0.0000 0/10 13
props ❌ FAIL 0.0000 0.0000 0/10 10
snippets ❌ FAIL 0.0000 0.0000 0/10 10

mistralai/mistral-medium-3

Test Status pass@1 pass@10 Passing Samples Errors Actions
counter ✅ PASS 1.0000 1.0000 10/10 0
derived ✅ PASS 1.0000 1.0000 10/10 0
derived-by ✅ PASS 1.0000 1.0000 10/10 0
each ⚠️ PARTIAL 0.8000 1.0000 8/10 2
effect ✅ PASS 1.0000 1.0000 10/10 0
hello-world ⚠️ PARTIAL 0.9000 1.0000 9/10 1
inspect ❌ FAIL 0.0000 0.0000 0/10 16
props ⚠️ PARTIAL 0.8000 1.0000 8/10 2
snippets ❌ FAIL 0.0000 0.0000 0/10 10

mistralai/mistral-medium-3.1

Test Status pass@1 pass@10 Passing Samples Errors Actions
counter ✅ PASS 1.0000 1.0000 10/10 0
derived ✅ PASS 1.0000 1.0000 10/10 0
derived-by ⚠️ PARTIAL 0.5000 1.0000 5/10 5
each ⚠️ PARTIAL 0.2000 1.0000 2/10 8
effect ✅ PASS 1.0000 1.0000 10/10 0
hello-world ❌ FAIL 0.0000 0.0000 0/10 10
inspect ❌ FAIL 0.0000 0.0000 0/10 13
props ❌ FAIL 0.0000 0.0000 0/10 10
snippets ❌ FAIL 0.0000 0.0000 0/10 10

moonshotai/kimi-dev-72b:free

Test Status pass@1 pass@10 Passing Samples Errors Actions
counter ❌ FAIL 0.0000 0.0000 0/10 10
derived ❌ FAIL 0.0000 0.0000 0/10 10
derived-by ❌ FAIL 0.0000 0.0000 0/10 10
each ❌ FAIL 0.0000 0.0000 0/10 10
effect ❌ FAIL 0.0000 0.0000 0/10 10
hello-world ⚠️ PARTIAL 0.3000 1.0000 3/10 8
inspect ❌ FAIL 0.0000 0.0000 0/10 10
props ❌ FAIL 0.0000 0.0000 0/10 10
snippets ❌ FAIL 0.0000 0.0000 0/10 10

moonshotai/kimi-k2

Test Status pass@1 pass@10 Passing Samples Errors Actions
counter ✅ PASS 1.0000 1.0000 10/10 0
derived ✅ PASS 1.0000 1.0000 10/10 0
derived-by ✅ PASS 1.0000 1.0000 10/10 0
each ✅ PASS 1.0000 1.0000 10/10 0
effect ✅ PASS 1.0000 1.0000 10/10 0
hello-world ⚠️ PARTIAL 0.9000 1.0000 9/10 1
inspect ❌ FAIL 0.0000 0.0000 0/10 22
props ✅ PASS 1.0000 1.0000 10/10 0
snippets ⚠️ PARTIAL 0.7000 1.0000 7/10 4

openai/gpt-oss-120b

Test Status pass@1 pass@10 Passing Samples Errors Actions
counter ⚠️ PARTIAL 0.8000 1.0000 8/10 2
derived ❌ FAIL 0.0000 0.0000 0/10 14
derived-by ⚠️ PARTIAL 0.3000 1.0000 3/10 7
each ⚠️ PARTIAL 0.5000 1.0000 5/10 6
effect ⚠️ PARTIAL 0.6000 1.0000 6/10 4
hello-world ✅ PASS 1.0000 1.0000 10/10 0
inspect ❌ FAIL 0.0000 0.0000 0/10 19
props ❌ FAIL 0.0000 0.0000 0/10 10
snippets ❌ FAIL 0.0000 0.0000 0/10 14

openai/gpt-oss-20b

Test Status pass@1 pass@10 Passing Samples Errors Actions
counter ⚠️ PARTIAL 0.2000 1.0000 2/10 8
derived ❌ FAIL 0.0000 0.0000 0/10 15
derived-by ❌ FAIL 0.0000 0.0000 0/10 14
each ⚠️ PARTIAL 0.6000 1.0000 6/10 5
effect ❌ FAIL 0.0000 0.0000 0/10 16
hello-world ✅ PASS 1.0000 1.0000 10/10 0
inspect ❌ FAIL 0.0000 0.0000 0/10 13
props ❌ FAIL 0.0000 0.0000 0/10 16
snippets ❌ FAIL 0.0000 0.0000 0/10 12

openrouter/horizon-alpha

Test Status pass@1 pass@10 Passing Samples Errors Actions
counter ✅ PASS 1.0000 1.0000 10/10 0
derived ✅ PASS 1.0000 1.0000 10/10 0
derived-by ✅ PASS 1.0000 1.0000 10/10 0
each ✅ PASS 1.0000 1.0000 10/10 0
effect ✅ PASS 1.0000 1.0000 10/10 0
hello-world ✅ PASS 1.0000 1.0000 10/10 0
inspect ❌ FAIL 0.0000 0.0000 0/10 22
props ✅ PASS 1.0000 1.0000 10/10 0
snippets ❌ FAIL 0.0000 0.0000 0/10 10

openrouter/horizon-beta

Test Status pass@1 pass@10 Passing Samples Errors Actions
counter ✅ PASS 1.0000 1.0000 10/10 0
derived ✅ PASS 1.0000 1.0000 10/10 0
derived-by ✅ PASS 1.0000 1.0000 10/10 0
each ✅ PASS 1.0000 1.0000 10/10 0
effect ✅ PASS 1.0000 1.0000 10/10 0
hello-world ✅ PASS 1.0000 1.0000 10/10 0
inspect ❌ FAIL 0.0000 0.0000 0/10 25
props ⚠️ PARTIAL 0.9000 1.0000 9/10 1
snippets ❌ FAIL 0.0000 0.0000 0/10 10

qwen/qwen3-235b-a22b-07-25

Test Status pass@1 pass@10 Passing Samples Errors Actions
counter ✅ PASS 1.0000 1.0000 10/10 0
derived ⚠️ PARTIAL 0.9000 1.0000 9/10 1
derived-by ⚠️ PARTIAL 0.8000 1.0000 8/10 2
each ⚠️ PARTIAL 0.9000 1.0000 9/10 1
effect ⚠️ PARTIAL 0.5000 1.0000 5/10 5
hello-world ⚠️ PARTIAL 0.9000 1.0000 9/10 1
inspect ❌ FAIL 0.0000 0.0000 0/10 31
props ⚠️ PARTIAL 0.2000 1.0000 2/10 20
snippets ❌ FAIL 0.0000 0.0000 0/10 10

qwen/qwen3-235b-a22b-thinking-2507

Test Status pass@1 pass@10 Passing Samples Errors Actions
counter ⚠️ PARTIAL 0.8000 1.0000 8/10 2
derived ⚠️ PARTIAL 0.8000 1.0000 8/10 3
derived-by ⚠️ PARTIAL 0.6000 1.0000 6/10 4
each ✅ PASS 1.0000 1.0000 10/10 0
effect ⚠️ PARTIAL 0.8000 1.0000 8/10 3
hello-world ✅ PASS 1.0000 1.0000 10/10 0
inspect ❌ FAIL 0.0000 0.0000 0/10 16
props ⚠️ PARTIAL 0.2000 1.0000 2/10 11
snippets ❌ FAIL 0.0000 0.0000 0/10 14

qwen/qwen3-30b-a3b

Test Status pass@1 pass@10 Passing Samples Errors Actions
counter ⚠️ PARTIAL 0.6000 1.0000 6/10 7
derived ⚠️ PARTIAL 0.3000 1.0000 3/10 8
derived-by ⚠️ PARTIAL 0.3000 1.0000 3/10 9
each ⚠️ PARTIAL 0.9000 1.0000 9/10 1
effect ⚠️ PARTIAL 0.3000 1.0000 3/10 11
hello-world ⚠️ PARTIAL 0.7000 1.0000 7/10 5
inspect ❌ FAIL 0.0000 0.0000 0/10 13
props ❌ FAIL 0.0000 0.0000 0/10 22
snippets ❌ FAIL 0.0000 0.0000 0/10 10

qwen/qwen3-30b-a3b-instruct-2507

Test Status pass@1 pass@10 Passing Samples Errors Actions
counter ❌ FAIL 0.0000 0.0000 0/10 10
derived ❌ FAIL 0.0000 0.0000 0/10 10
derived-by ❌ FAIL 0.0000 0.0000 0/10 12
each ❌ FAIL 0.0000 0.0000 0/10 10
effect ❌ FAIL 0.0000 0.0000 0/10 10
hello-world ⚠️ PARTIAL 0.9000 1.0000 9/10 1
inspect ❌ FAIL 0.0000 0.0000 0/10 10
props ❌ FAIL 0.0000 0.0000 0/10 16
snippets ❌ FAIL 0.0000 0.0000 0/10 10

qwen/qwen3-coder

Test Status pass@1 pass@10 Passing Samples Errors Actions
counter ✅ PASS 1.0000 1.0000 10/10 0
derived ✅ PASS 1.0000 1.0000 10/10 0
derived-by ⚠️ PARTIAL 0.3000 1.0000 3/10 21
each ✅ PASS 1.0000 1.0000 10/10 0
effect ⚠️ PARTIAL 0.9000 1.0000 9/10 2
hello-world ✅ PASS 1.0000 1.0000 10/10 0
inspect ❌ FAIL 0.0000 0.0000 0/10 37
props ⚠️ PARTIAL 0.9000 1.0000 9/10 1
snippets ❌ FAIL 0.0000 0.0000 0/10 10

x-ai/grok-3

Test Status pass@1 pass@10 Passing Samples Errors Actions
counter ✅ PASS 1.0000 1.0000 10/10 0
derived ✅ PASS 1.0000 1.0000 10/10 0
derived-by ✅ PASS 1.0000 1.0000 10/10 0
each ✅ PASS 1.0000 1.0000 10/10 0
effect ✅ PASS 1.0000 1.0000 10/10 0
hello-world ✅ PASS 1.0000 1.0000 10/10 0
inspect ❌ FAIL 0.0000 0.0000 0/10 10
props ⚠️ PARTIAL 0.8000 1.0000 8/10 2
snippets ⚠️ PARTIAL 0.5000 1.0000 5/10 5

x-ai/grok-3-beta

Test Status pass@1 pass@10 Passing Samples Errors Actions
counter ✅ PASS 1.0000 1.0000 10/10 0
derived ✅ PASS 1.0000 1.0000 10/10 0
derived-by ✅ PASS 1.0000 1.0000 10/10 0
each ✅ PASS 1.0000 1.0000 10/10 0
effect ✅ PASS 1.0000 1.0000 10/10 0
hello-world ✅ PASS 1.0000 1.0000 10/10 0
inspect ❌ FAIL 0.0000 0.0000 0/10 10
props ✅ PASS 1.0000 1.0000 10/10 0
snippets ⚠️ PARTIAL 0.5000 1.0000 5/10 5

x-ai/grok-3-mini

Test Status pass@1 pass@10 Passing Samples Errors Actions
counter ✅ PASS 1.0000 1.0000 10/10 0
derived ⚠️ PARTIAL 0.9000 1.0000 9/10 1
derived-by ⚠️ PARTIAL 0.5000 1.0000 5/10 9
each ✅ PASS 1.0000 1.0000 10/10 0
effect ⚠️ PARTIAL 0.5000 1.0000 5/10 10
hello-world ✅ PASS 1.0000 1.0000 10/10 0
inspect ❌ FAIL 0.0000 0.0000 0/10 31
props ⚠️ PARTIAL 0.9000 1.0000 9/10 1
snippets ❌ FAIL 0.0000 0.0000 0/10 10

x-ai/grok-3-mini-beta

Test Status pass@1 pass@10 Passing Samples Errors Actions
counter ✅ PASS 1.0000 1.0000 10/10 0
derived ✅ PASS 1.0000 1.0000 10/10 0
derived-by ⚠️ PARTIAL 0.6000 1.0000 6/10 7
each ⚠️ PARTIAL 0.9000 1.0000 9/10 1
effect ⚠️ PARTIAL 0.7000 1.0000 7/10 6
hello-world ✅ PASS 1.0000 1.0000 10/10 0
inspect ❌ FAIL 0.0000 0.0000 0/10 40
props ⚠️ PARTIAL 0.8000 1.0000 8/10 2
snippets ⚠️ PARTIAL 0.1000 1.0000 1/10 11

x-ai/grok-4

Test Status pass@1 pass@10 Passing Samples Errors Actions
counter ✅ PASS 1.0000 1.0000 10/10 0
derived ✅ PASS 1.0000 1.0000 10/10 0
derived-by ⚠️ PARTIAL 0.9000 1.0000 9/10 3
each ✅ PASS 1.0000 1.0000 10/10 0
effect ✅ PASS 1.0000 1.0000 10/10 0
hello-world ✅ PASS 1.0000 1.0000 10/10 0
inspect ❌ FAIL 0.0000 0.0000 0/10 10
props ✅ PASS 1.0000 1.0000 10/10 0
snippets ✅ PASS 1.0000 1.0000 10/10 0

z-ai/glm-4-32b

Test Status pass@1 pass@10 Passing Samples Errors Actions
counter ⚠️ PARTIAL 0.9000 1.0000 9/10 1
derived ✅ PASS 1.0000 1.0000 10/10 0
derived-by ⚠️ PARTIAL 0.8000 1.0000 8/10 4
each ❌ FAIL 0.0000 0.0000 0/10 10
effect ⚠️ PARTIAL 0.2000 1.0000 2/10 12
hello-world ✅ PASS 1.0000 1.0000 10/10 0
inspect ❌ FAIL 0.0000 0.0000 0/10 25
props ⚠️ PARTIAL 0.3000 1.0000 3/10 7
snippets ❌ FAIL 0.0000 0.0000 0/10 10

z-ai/glm-4.5

Test Status pass@1 pass@10 Passing Samples Errors Actions
counter ⚠️ PARTIAL 0.8000 1.0000 8/10 2
derived ⚠️ PARTIAL 0.4000 1.0000 4/10 9
derived-by ⚠️ PARTIAL 0.8000 1.0000 8/10 2
each ✅ PASS 1.0000 1.0000 10/10 0
effect ⚠️ PARTIAL 0.8000 1.0000 8/10 4
hello-world ⚠️ PARTIAL 0.9000 1.0000 9/10 1
inspect ❌ FAIL 0.0000 0.0000 0/10 25
props ⚠️ PARTIAL 0.8000 1.0000 8/10 5
snippets ⚠️ PARTIAL 0.5000 1.0000 5/10 5

z-ai/glm-4.5-air

Test Status pass@1 pass@10 Passing Samples Errors Actions
counter ⚠️ PARTIAL 0.9000 1.0000 9/10 3
derived ⚠️ PARTIAL 0.7000 1.0000 7/10 4
derived-by ⚠️ PARTIAL 0.9000 1.0000 9/10 3
each ✅ PASS 1.0000 1.0000 10/10 0
effect ⚠️ PARTIAL 0.4000 1.0000 4/10 7
hello-world ✅ PASS 1.0000 1.0000 10/10 0
inspect ❌ FAIL 0.0000 0.0000 0/10 13
props ⚠️ PARTIAL 0.3000 1.0000 3/10 7
snippets ❌ FAIL 0.0000 0.0000 0/10 10

Z.ai

glm-4-32b-0414-128k

Test Status pass@1 pass@10 Passing Samples Errors Actions
counter ✅ PASS 1.0000 1.0000 10/10 0
derived ✅ PASS 1.0000 1.0000 10/10 0
derived-by ⚠️ PARTIAL 0.6000 1.0000 6/10 12
each ❌ FAIL 0.0000 0.0000 0/10 10
effect ⚠️ PARTIAL 0.2000 1.0000 2/10 12
hello-world ⚠️ PARTIAL 0.9000 1.0000 9/10 1
inspect ❌ FAIL 0.0000 0.0000 0/10 25
props ⚠️ PARTIAL 0.3000 1.0000 3/10 10
snippets ❌ FAIL 0.0000 0.0000 0/10 10

glm-4.5

Test Status pass@1 pass@10 Passing Samples Errors Actions
counter ✅ PASS 1.0000 1.0000 10/10 0
derived ⚠️ PARTIAL 0.7000 1.0000 7/10 6
derived-by ✅ PASS 1.0000 1.0000 10/10 0
each ✅ PASS 1.0000 1.0000 10/10 0
effect ⚠️ PARTIAL 0.9000 1.0000 9/10 2
hello-world ✅ PASS 1.0000 1.0000 10/10 0
inspect ❌ FAIL 0.0000 0.0000 0/10 10
props ✅ PASS 1.0000 1.0000 10/10 0
snippets ⚠️ PARTIAL 0.5000 1.0000 5/10 5

glm-4.5-air

Test Status pass@1 pass@10 Passing Samples Errors Actions
counter ✅ PASS 1.0000 1.0000 10/10 0
derived ⚠️ PARTIAL 0.6000 1.0000 6/10 5
derived-by ⚠️ PARTIAL 0.8000 1.0000 8/10 2
each ✅ PASS 1.0000 1.0000 10/10 0
effect ⚠️ PARTIAL 0.6000 1.0000 6/10 4
hello-world ✅ PASS 1.0000 1.0000 10/10 0
inspect ❌ FAIL 0.0000 0.0000 0/10 10
props ⚠️ PARTIAL 0.2000 1.0000 2/10 8
snippets ⚠️ PARTIAL 0.1000 1.0000 1/10 9

glm-4.5-airx

Test Status pass@1 pass@10 Passing Samples Errors Actions
counter ✅ PASS 1.0000 1.0000 10/10 0
derived ⚠️ PARTIAL 0.4000 1.0000 4/10 8
derived-by ⚠️ PARTIAL 0.9000 1.0000 9/10 1
each ✅ PASS 1.0000 1.0000 10/10 0
effect ⚠️ PARTIAL 0.4000 1.0000 4/10 7
hello-world ✅ PASS 1.0000 1.0000 10/10 0
inspect ❌ FAIL 0.0000 0.0000 0/10 10
props ⚠️ PARTIAL 0.1000 1.0000 1/10 9
snippets ⚠️ PARTIAL 0.2000 1.0000 2/10 8

glm-4.5-x

Test Status pass@1 pass@10 Passing Samples Errors Actions
counter ✅ PASS 1.0000 1.0000 10/10 0
derived ⚠️ PARTIAL 0.5000 1.0000 5/10 10
derived-by ✅ PASS 1.0000 1.0000 10/10 0
each ✅ PASS 1.0000 1.0000 10/10 0
effect ✅ PASS 1.0000 1.0000 10/10 0
hello-world ✅ PASS 1.0000 1.0000 10/10 0
inspect ❌ FAIL 0.0000 0.0000 0/10 16
props ✅ PASS 1.0000 1.0000 10/10 0
snippets ⚠️ PARTIAL 0.7000 1.0000 7/10 3