SvelteBench Visualization

Note: OpenAI thinking models (o3, o4) do not support temperature adjustments. o1-pro models use "medium" reasoning effort setting.

← Back to All Results

Anthropic

claude-3-5-haiku-20241022

Test Status pass@1 pass@10 Passing Samples Errors Actions
counter ✅ PASS 1.0000 1.0000 10/10 0
derived ✅ PASS 1.0000 1.0000 10/10 0
derived-by ✅ PASS 1.0000 1.0000 10/10 0
each ✅ PASS 1.0000 1.0000 10/10 0
effect ✅ PASS 1.0000 1.0000 10/10 0
hello-world ✅ PASS 1.0000 1.0000 10/10 0
inspect ❌ FAIL 0.0000 0.0000 0/10 10
props ⚠️ PARTIAL 0.6000 1.0000 6/10 4
snippets ❌ FAIL 0.0000 0.0000 0/10 10

claude-3-7-sonnet-20250219

Test Status pass@1 pass@10 Passing Samples Errors Actions
counter ✅ PASS 1.0000 1.0000 10/10 0
derived ✅ PASS 1.0000 1.0000 10/10 0
derived-by ✅ PASS 1.0000 1.0000 10/10 0
each ❌ FAIL 0.0000 0.0000 0/10 10
effect ✅ PASS 1.0000 1.0000 10/10 0
hello-world ✅ PASS 1.0000 1.0000 10/10 0
inspect ❌ FAIL 0.0000 0.0000 0/10 10
props ⚠️ PARTIAL 0.1000 1.0000 1/10 9
snippets ❌ FAIL 0.0000 0.0000 0/10 10

claude-opus-4-20250514

Test Status pass@1 pass@10 Passing Samples Errors Actions
counter ✅ PASS 1.0000 1.0000 10/10 0
derived ✅ PASS 1.0000 1.0000 10/10 0
derived-by ✅ PASS 1.0000 1.0000 10/10 0
each ✅ PASS 1.0000 1.0000 10/10 0
effect ✅ PASS 1.0000 1.0000 10/10 0
hello-world ✅ PASS 1.0000 1.0000 10/10 0
inspect ❌ FAIL 0.0000 0.0000 0/10 10
props ✅ PASS 1.0000 1.0000 10/10 0
snippets ✅ PASS 1.0000 1.0000 10/10 0

claude-sonnet-4-20250514

Test Status pass@1 pass@10 Passing Samples Errors Actions
counter ✅ PASS 1.0000 1.0000 10/10 0
derived ✅ PASS 1.0000 1.0000 10/10 0
derived-by ✅ PASS 1.0000 1.0000 10/10 0
each ✅ PASS 1.0000 1.0000 10/10 0
effect ✅ PASS 1.0000 1.0000 10/10 0
hello-world ✅ PASS 1.0000 1.0000 10/10 0
inspect ❌ FAIL 0.0000 0.0000 0/10 10
props ✅ PASS 1.0000 1.0000 10/10 0
snippets ✅ PASS 1.0000 1.0000 10/10 0

Google

gemini-2.5-flash-preview-04-17

Test Status pass@1 pass@10 Passing Samples Errors Actions
counter ✅ PASS 1.0000 1.0000 10/10 0
derived ✅ PASS 1.0000 1.0000 10/10 0
derived-by ⚠️ PARTIAL 0.6000 1.0000 6/10 4
each ⚠️ PARTIAL 0.9000 1.0000 9/10 1
effect ⚠️ PARTIAL 0.7000 1.0000 7/10 6
hello-world ✅ PASS 1.0000 1.0000 10/10 0
inspect ❌ FAIL 0.0000 0.0000 0/10 13
props ✅ PASS 1.0000 1.0000 10/10 0
snippets ⚠️ PARTIAL 0.1000 1.0000 1/10 9

gemini-2.5-pro-preview-03-25

Test Status pass@1 pass@10 Passing Samples Errors Actions
counter ✅ PASS 1.0000 1.0000 10/10 0
derived ✅ PASS 1.0000 1.0000 10/10 0
derived-by ✅ PASS 1.0000 1.0000 10/10 0
each ✅ PASS 1.0000 1.0000 10/10 0
effect ✅ PASS 1.0000 1.0000 10/10 0
hello-world ✅ PASS 1.0000 1.0000 10/10 0
inspect ❌ FAIL 0.0000 0.0000 0/10 13
props ✅ PASS 1.0000 1.0000 10/10 0
snippets ⚠️ PARTIAL 0.5000 1.0000 5/10 5

gemini-2.5-pro-preview-05-06

Test Status pass@1 pass@10 Passing Samples Errors Actions
counter ✅ PASS 1.0000 1.0000 10/10 0
derived ✅ PASS 1.0000 1.0000 10/10 0
derived-by ⚠️ PARTIAL 0.9000 1.0000 9/10 1
each ✅ PASS 1.0000 1.0000 10/10 0
effect ⚠️ PARTIAL 0.8000 1.0000 8/10 4
hello-world ✅ PASS 1.0000 1.0000 10/10 0
inspect ❌ FAIL 0.0000 0.0000 0/10 13
props ⚠️ PARTIAL 0.9000 1.0000 9/10 1
snippets ⚠️ PARTIAL 0.6000 1.0000 6/10 4

gemini-2.5-pro-preview-06-05

Test Status pass@1 pass@10 Passing Samples Errors Actions
counter ✅ PASS 1.0000 1.0000 10/10 0
derived ✅ PASS 1.0000 1.0000 10/10 0
derived-by ✅ PASS 1.0000 1.0000 10/10 0
each ✅ PASS 1.0000 1.0000 10/10 0
effect ⚠️ PARTIAL 0.9000 1.0000 9/10 2
hello-world ✅ PASS 1.0000 1.0000 10/10 0
inspect ❌ FAIL 0.0000 0.0000 0/10 10
props ✅ PASS 1.0000 1.0000 10/10 0
snippets ⚠️ PARTIAL 0.4000 1.0000 4/10 6

gemma-3-27b-it

Test Status pass@1 pass@10 Passing Samples Errors Actions
counter ❌ FAIL 0.0000 0.0000 0/10 10
derived ❌ FAIL 0.0000 0.0000 0/10 10
derived-by ❌ FAIL 0.0000 0.0000 0/10 10
each ❌ FAIL 0.0000 0.0000 0/10 10
effect ❌ FAIL 0.0000 0.0000 0/10 10
hello-world ✅ PASS 1.0000 1.0000 10/10 0
inspect ❌ FAIL 0.0000 0.0000 0/10 10
props ❌ FAIL 0.0000 0.0000 0/10 10
snippets ❌ FAIL 0.0000 0.0000 0/10 10

OpenAI

chatgpt-4o-latest

Test Status pass@1 pass@10 Passing Samples Errors Actions
counter ⚠️ PARTIAL 0.7000 1.0000 7/10 3
derived ❌ FAIL 0.0000 0.0000 0/10 10
derived-by ⚠️ PARTIAL 0.1000 1.0000 1/10 9
each ⚠️ PARTIAL 0.4000 1.0000 4/10 6
effect ❌ FAIL 0.0000 0.0000 0/10 13
hello-world ✅ PASS 1.0000 1.0000 10/10 0
inspect ❌ FAIL 0.0000 0.0000 0/10 13
props ❌ FAIL 0.0000 0.0000 0/10 10
snippets ⚠️ PARTIAL 0.1000 1.0000 1/10 9

gpt-4.1-2025-04-14

Test Status pass@1 pass@10 Passing Samples Errors Actions
counter ⚠️ PARTIAL 0.7000 1.0000 7/10 3
derived ❌ FAIL 0.0000 0.0000 0/10 10
derived-by ❌ FAIL 0.0000 0.0000 0/10 10
each ⚠️ PARTIAL 0.2000 1.0000 2/10 9
effect ⚠️ PARTIAL 0.1000 1.0000 1/10 10
hello-world ✅ PASS 1.0000 1.0000 10/10 0
inspect ❌ FAIL 0.0000 0.0000 0/10 19
props ❌ FAIL 0.0000 0.0000 0/10 10
snippets ❌ FAIL 0.0000 0.0000 0/10 10

gpt-4.1-mini-2025-04-14

Test Status pass@1 pass@10 Passing Samples Errors Actions
counter ❌ FAIL 0.0000 0.0000 0/10 13
derived ❌ FAIL 0.0000 0.0000 0/10 13
derived-by ❌ FAIL 0.0000 0.0000 0/10 12
each ⚠️ PARTIAL 0.1000 1.0000 1/10 10
effect ❌ FAIL 0.0000 0.0000 0/10 16
hello-world ❌ FAIL 0.0000 0.0000 0/10 10
inspect ❌ FAIL 0.0000 0.0000 0/10 28
props ❌ FAIL 0.0000 0.0000 0/10 10
snippets ❌ FAIL 0.0000 0.0000 0/10 10

gpt-4.1-nano-2025-04-14

Test Status pass@1 pass@10 Passing Samples Errors Actions
counter ❌ FAIL 0.0000 0.0000 0/10 14
derived ❌ FAIL 0.0000 0.0000 0/10 13
derived-by ❌ FAIL 0.0000 0.0000 0/10 10
each ❌ FAIL 0.0000 0.0000 0/10 10
effect ❌ FAIL 0.0000 0.0000 0/10 13
hello-world ✅ PASS 1.0000 1.0000 10/10 0
inspect ❌ FAIL 0.0000 0.0000 0/10 10
props ❌ FAIL 0.0000 0.0000 0/10 10
snippets ❌ FAIL 0.0000 0.0000 0/10 10

gpt-4o-2024-08-06

Test Status pass@1 pass@10 Passing Samples Errors Actions
counter ❌ FAIL 0.0000 0.0000 0/10 13
derived ❌ FAIL 0.0000 0.0000 0/10 14
derived-by ⚠️ PARTIAL 0.4000 1.0000 4/10 14
each ❌ FAIL 0.0000 0.0000 0/10 11
effect ⚠️ PARTIAL 0.2000 1.0000 2/10 14
hello-world ✅ PASS 1.0000 1.0000 10/10 0
inspect ❌ FAIL 0.0000 0.0000 0/10 13
props ❌ FAIL 0.0000 0.0000 0/10 10
snippets ❌ FAIL 0.0000 0.0000 0/10 14

o1-pro-2025-03-19

Test Status pass@1 pass@10 Passing Samples Errors Actions
counter ❌ FAIL 0.0000 0.0000 0/1 1
derived ❌ FAIL 0.0000 0.0000 0/1 1
derived-by ❌ FAIL 0.0000 0.0000 0/1 3
each ❌ FAIL 0.0000 0.0000 0/1 1
effect ❌ FAIL 0.0000 0.0000 0/1 1
hello-world ✅ PASS 1.0000 1.0000 1/1 0
inspect ❌ FAIL 0.0000 0.0000 0/1 1
props ❌ FAIL 0.0000 0.0000 0/1 1
snippets ❌ FAIL 0.0000 0.0000 0/1 1

o3-2025-04-16

Test Status pass@1 pass@10 Passing Samples Errors Actions
counter ⚠️ PARTIAL 0.6000 1.0000 6/10 4
derived ⚠️ PARTIAL 0.1000 1.0000 1/10 11
derived-by ⚠️ PARTIAL 0.2000 1.0000 2/10 8
each ⚠️ PARTIAL 0.6000 1.0000 6/10 4
effect ⚠️ PARTIAL 0.2000 1.0000 2/10 11
hello-world ✅ PASS 1.0000 1.0000 10/10 0
inspect ❌ FAIL 0.0000 0.0000 0/10 13
props ❌ FAIL 0.0000 0.0000 0/10 10
snippets ❌ FAIL 0.0000 0.0000 0/10 10

o3-mini-2025-01-31

Test Status pass@1 pass@10 Passing Samples Errors Actions
counter ⚠️ PARTIAL 0.3000 1.0000 3/10 13
derived ❌ FAIL 0.0000 0.0000 0/10 11
derived-by ⚠️ PARTIAL 0.1000 1.0000 1/10 9
each ⚠️ PARTIAL 0.1000 1.0000 1/10 9
effect ❌ FAIL 0.0000 0.0000 0/10 17
hello-world ⚠️ PARTIAL 0.9000 1.0000 9/10 1
inspect ❌ FAIL 0.0000 0.0000 0/10 10
props ❌ FAIL 0.0000 0.0000 0/10 10
snippets ❌ FAIL 0.0000 0.0000 0/10 10

o4-mini-2025-04-16

Test Status pass@1 pass@10 Passing Samples Errors Actions
counter ⚠️ PARTIAL 0.1000 1.0000 1/10 12
derived ❌ FAIL 0.0000 0.0000 0/10 12
derived-by ❌ FAIL 0.0000 0.0000 0/10 10
each ⚠️ PARTIAL 0.2000 1.0000 2/10 8
effect ❌ FAIL 0.0000 0.0000 0/10 11
hello-world ⚠️ PARTIAL 0.9000 1.0000 9/10 1
inspect ❌ FAIL 0.0000 0.0000 0/10 13
props ❌ FAIL 0.0000 0.0000 0/10 10
snippets ❌ FAIL 0.0000 0.0000 0/10 10

OpenRouter

deepseek/deepseek-r1-0528

Test Status pass@1 pass@10 Passing Samples Errors Actions
counter ⚠️ PARTIAL 0.3000 1.0000 3/10 22
derived ⚠️ PARTIAL 0.6000 1.0000 6/10 6
derived-by ⚠️ PARTIAL 0.9000 1.0000 9/10 1
each ⚠️ PARTIAL 0.2000 1.0000 2/10 10
effect ✅ PASS 1.0000 1.0000 10/10 0
hello-world ✅ PASS 1.0000 1.0000 10/10 0
inspect ❌ FAIL 0.0000 0.0000 0/10 10
props ⚠️ PARTIAL 0.4000 1.0000 4/10 6
snippets ❌ FAIL 0.0000 0.0000 0/10 10

meta-llama/llama-4-maverick

Test Status pass@1 pass@10 Passing Samples Errors Actions
counter ✅ PASS 1.0000 1.0000 10/10 0
derived ✅ PASS 1.0000 1.0000 10/10 0
derived-by ✅ PASS 1.0000 1.0000 10/10 0
each ✅ PASS 1.0000 1.0000 10/10 0
effect ❌ FAIL 0.0000 0.0000 0/10 10
hello-world ⚠️ PARTIAL 0.8000 1.0000 8/10 2
inspect ❌ FAIL 0.0000 0.0000 0/10 13
props ✅ PASS 1.0000 1.0000 10/10 0
snippets ❌ FAIL 0.0000 0.0000 0/10 10

meta-llama/llama-4-scout

Test Status pass@1 pass@10 Passing Samples Errors Actions
counter ⚠️ PARTIAL 0.9000 1.0000 9/10 1
derived ⚠️ PARTIAL 0.2000 1.0000 2/10 8
derived-by ❌ FAIL 0.0000 0.0000 0/10 11
each ❌ FAIL 0.0000 0.0000 0/10 11
effect ❌ FAIL 0.0000 0.0000 0/10 10
hello-world ⚠️ PARTIAL 0.3000 1.0000 3/10 11
inspect ❌ FAIL 0.0000 0.0000 0/10 10
props ❌ FAIL 0.0000 0.0000 0/10 10
snippets ❌ FAIL 0.0000 0.0000 0/10 10

mistralai/devstral-small

Test Status pass@1 pass@10 Passing Samples Errors Actions
counter ⚠️ PARTIAL 0.7000 1.0000 7/10 6
derived ⚠️ PARTIAL 0.1000 1.0000 1/10 13
derived-by ⚠️ PARTIAL 0.1000 1.0000 1/10 13
each ❌ FAIL 0.0000 0.0000 0/10 10
effect ❌ FAIL 0.0000 0.0000 0/10 10
hello-world ⚠️ PARTIAL 0.3000 1.0000 3/10 7
inspect ❌ FAIL 0.0000 0.0000 0/10 13
props ❌ FAIL 0.0000 0.0000 0/10 10
snippets ❌ FAIL 0.0000 0.0000 0/10 10

mistralai/mistral-medium-3

Test Status pass@1 pass@10 Passing Samples Errors Actions
counter ✅ PASS 1.0000 1.0000 10/10 0
derived ✅ PASS 1.0000 1.0000 10/10 0
derived-by ✅ PASS 1.0000 1.0000 10/10 0
each ⚠️ PARTIAL 0.8000 1.0000 8/10 2
effect ✅ PASS 1.0000 1.0000 10/10 0
hello-world ⚠️ PARTIAL 0.9000 1.0000 9/10 1
inspect ❌ FAIL 0.0000 0.0000 0/10 16
props ⚠️ PARTIAL 0.8000 1.0000 8/10 2
snippets ❌ FAIL 0.0000 0.0000 0/10 10

qwen/qwen3-30b-a3b

Test Status pass@1 pass@10 Passing Samples Errors Actions
counter ⚠️ PARTIAL 0.6000 1.0000 6/10 7
derived ⚠️ PARTIAL 0.3000 1.0000 3/10 8
derived-by ⚠️ PARTIAL 0.3000 1.0000 3/10 9
each ⚠️ PARTIAL 0.9000 1.0000 9/10 1
effect ⚠️ PARTIAL 0.3000 1.0000 3/10 11
hello-world ⚠️ PARTIAL 0.7000 1.0000 7/10 5
inspect ❌ FAIL 0.0000 0.0000 0/10 13
props ❌ FAIL 0.0000 0.0000 0/10 22
snippets ❌ FAIL 0.0000 0.0000 0/10 10

x-ai/grok-3-beta

Test Status pass@1 pass@10 Passing Samples Errors Actions
counter ✅ PASS 1.0000 1.0000 10/10 0
derived ✅ PASS 1.0000 1.0000 10/10 0
derived-by ✅ PASS 1.0000 1.0000 10/10 0
each ✅ PASS 1.0000 1.0000 10/10 0
effect ✅ PASS 1.0000 1.0000 10/10 0
hello-world ✅ PASS 1.0000 1.0000 10/10 0
inspect ❌ FAIL 0.0000 0.0000 0/10 10
props ✅ PASS 1.0000 1.0000 10/10 0
snippets ⚠️ PARTIAL 0.5000 1.0000 5/10 5

x-ai/grok-3-mini-beta

Test Status pass@1 pass@10 Passing Samples Errors Actions
counter ✅ PASS 1.0000 1.0000 10/10 0
derived ✅ PASS 1.0000 1.0000 10/10 0
derived-by ⚠️ PARTIAL 0.6000 1.0000 6/10 7
each ⚠️ PARTIAL 0.9000 1.0000 9/10 1
effect ⚠️ PARTIAL 0.7000 1.0000 7/10 6
hello-world ✅ PASS 1.0000 1.0000 10/10 0
inspect ❌ FAIL 0.0000 0.0000 0/10 40
props ⚠️ PARTIAL 0.8000 1.0000 8/10 2
snippets ⚠️ PARTIAL 0.1000 1.0000 1/10 11