10x Bench Results

See how LLMs perform at coding the Przeprogramowani.pl website in Astro + React + Tailwind + Cloudflare stack.

Generated: 2/26/2026, 3:11:30 PM • Total attempts: 74

Model Family Rankings

Only latest models
1

GPT-5.3-Codex

10 attemptsvia Codex Desktop (High Effort)

Cost: $1.75 / $14

85.0%

Average: 8.5/10.0

2

Claude Opus 4.6

10 attemptsvia Claude Code (High Effort)

Cost: $5 / $25

75.5%

Average: 7.5/10.0

3

Claude Sonnet 4.6

5 attemptsvia Claude Code (High Effort)

Cost: $3 / $15

71.0%

Average: 7.1/10.0

4

Minimax M2.5

5 attemptsvia OpenCode

Cost: $0.3 / $2.4

69.0%

Average: 6.9/10.0

5

GLM-5

5 attemptsvia OpenCode

Cost: $0.3 / $2.55

68.0%

Average: 6.8/10.0

6

Gemini 3.1 Pro

5 attemptsvia Cursor

Cost: $2 / $12

67.0%

Average: 6.7/10.0

7

Kimi K2.5

5 attemptsvia OpenCode

Cost: $0.6 / $3

63.0%

Average: 6.3/10.0

8

Grok Code Fast 1

5 attemptsvia OpenCode

Cost: $0.2 / $1.5

59.0%

Average: 5.9/10.0

9

Qwen 3 Max

3 attemptsvia OpenCode

Cost: $1.2 / $6

45.0%

Average: 4.5/10.0

10

Devstral 2

3 attemptsvia OpenCode

Cost: $0.4 / $2

16.7%

Average: 1.7/10.0

Detailed Comparison

Click on any score to reveal the detailed scoring explanation for that criterion.

Filters:
Criterion
GPT-5.3-Codex
Attempt 1
GPT-5.3-Codex
Attempt 2
GPT-5.3-Codex
Attempt 3
GPT-5.3-Codex
Attempt 4
GPT-5.3-Codex
Attempt 5
GPT-5.3-Codex
Attempt 6
GPT-5.3-Codex
Attempt 7
GPT-5.3-Codex
Attempt 8
GPT-5.3-Codex
Attempt 9
GPT-5.3-Codex
Attempt 10
Claude Opus 4.6
Attempt 1
Claude Opus 4.6
Attempt 2
Claude Opus 4.6
Attempt 3
Claude Opus 4.6
Attempt 4
Claude Opus 4.6
Attempt 5
Claude Opus 4.6
Attempt 6
Claude Opus 4.6
Attempt 7
Claude Opus 4.6
Attempt 8
Claude Opus 4.6
Attempt 9
Claude Opus 4.6
Attempt 10
Claude Sonnet 4.6
Attempt 1
Claude Sonnet 4.6
Attempt 2
Claude Sonnet 4.6
Attempt 3
Claude Sonnet 4.6
Attempt 4
Claude Sonnet 4.6
Attempt 5
Minimax M2.5
Attempt 1
Minimax M2.5
Attempt 2
Minimax M2.5
Attempt 3
Minimax M2.5
Attempt 4
Minimax M2.5
Attempt 5
GLM-5
Attempt 1
GLM-5
Attempt 2
GLM-5
Attempt 3
GLM-5
Attempt 4
GLM-5
Attempt 5
Gemini 3.1 Pro
Attempt 1
Gemini 3.1 Pro
Attempt 2
Gemini 3.1 Pro
Attempt 3
Gemini 3.1 Pro
Attempt 4
Gemini 3.1 Pro
Attempt 5
Kimi K2.5
Attempt 1
Kimi K2.5
Attempt 2
Kimi K2.5
Attempt 3
Kimi K2.5
Attempt 4
Kimi K2.5
Attempt 5
Grok Code Fast 1
Attempt 1
Grok Code Fast 1
Attempt 2
Grok Code Fast 1
Attempt 3
Grok Code Fast 1
Attempt 4
Grok Code Fast 1
Attempt 5
Qwen 3 Max
Attempt 1
Qwen 3 Max
Attempt 2
Qwen 3 Max
Attempt 3
Devstral 2
Attempt 1
Devstral 2
Attempt 2
Devstral 2
Attempt 3
Local build
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
1
Manual testing
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
1
1
0
1
0
1
1
1
0
0
0
1
Tech stack
0.5
1
1
0.5
1
1
1
1
1
1
0.5
1
1
1
0.5
0.5
0
1
0.5
0.5
0.5
0
0
0
0
0.5
1
1
1
1
0.5
0
0.5
0.5
0.5
1
1
0
1
0.5
0.5
0.5
0.5
0.5
0.5
0
0
1
0
0.5
0
0.5
0
0
0
0
O nas page
0.5
0.5
1
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
1
0.5
0.5
0.5
0
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0
0
0
0.5
Podcast page
1
0.5
1
0.5
1
0.5
0.5
0.5
1
0.5
0.5
0.5
0.5
0.5
0.5
1
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0
0
0
0.5
YouTube page
1
1
1
1
1
1
1
1
1
1
0.5
0.5
0.5
0.5
0.5
1
0.5
1
0.5
1
0.5
0.5
0.5
0.5
0.5
0.5
0.5
1
0.5
1
0.5
0.5
1
1
1
0
0.5
0.5
0.5
0.5
0.5
1
1
0.5
0.5
1
1
0.5
1
1
1
1
0
0
0
0.5
Kursy section
0.5
1
1
0.5
1
0.5
0.5
0.5
0.5
0.5
0.5
1
0.5
0.5
1
0.5
0.5
1
0.5
0.5
0.5
1
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
1
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0
0
0
0.5
Consistent UI
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
1
1
1
1
1
1
1
0
1
1
1
1
1
0
0
1
0
1
1
0
1
1
0
0
0
1
Responsive design
1
1
1
1
1
1
1
0.5
1
1
1
1
1
1
0
1
1
1
0.5
1
1
1
1
1
1
0.5
1
0.5
0.5
0
0.5
1
1
1
1
1
1
1
1
1
1
1
1
0
0
1
0
0.5
0.5
0
0.5
0.5
0
0
0
0
SEO Tags
1
0.5
0.5
1
0.5
1
1
1
0.5
0.5
0.5
1
1
1
1
1
0.5
1
1
1
0.5
1
1
1
1
0.5
0.5
0.5
0.5
0.5
1
0.5
0.5
0.5
1
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
1
0.5
0.5
0.5
0.5
0.5
0.5
0
0
0
0
Penalty
N/AN/AN/AN/AN/AN/AN/AN/AN/AN/AN/AN/AN/AN/AN/AN/AN/AN/AN/AN/AN/AN/AN/AN/AN/AN/AN/AN/AN/AN/A
-1
-1
N/AN/A
-1
N/AN/AN/AN/AN/AN/AN/AN/AN/AN/AN/AN/AN/AN/AN/AN/AN/AN/AN/AN/AN/A
Task completion time
9min 19s
9min 24s
9min 9s
8min 16s
9min 40s
8min 0s
8min 20s
9min 25s
8min 36s
8min 19s
9min 36s
10min 20s
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
7min 52s
6min 48s
5min 52s
7min 1s
6min 15s
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
16min 15s
16min 21s
8min 18s
20min 39s
15min 36s
N/A
N/A
N/A
N/A
N/A
N/A
8min 23s
5min 44s
9min 45s
3min 27s
2min 20s
Test run
9.02.2026 16:40
9.02.2026 16:40
9.02.2026 16:40
9.02.2026 22:58
9.02.2026 22:58
11.02.2026 21:41
11.02.2026 21:45
11.02.2026 21:46
11.02.2026 21:48
11.02.2026 21:50
9.02.2026 16:40
9.02.2026 16:40
9.02.2026 16:40
9.02.2026 22:45
9.02.2026 23:05
11.02.2026 21:28
11.02.2026 21:34
11.02.2026 21:32
11.02.2026 21:38
11.02.2026 21:40
17.02.2026 21:42
17.02.2026 21:44
17.02.2026 21:49
17.02.2026 21:50
17.02.2026 21:55
12.02.2026 19:34
12.02.2026 19:40
12.02.2026 19:39
12.02.2026 19:42
12.02.2026 19:45
N/A
16.02.2026 07:36
16.02.2026 08:36
16.02.2026 12:32
16.02.2026 09:05
26.02.2026 14:38
26.02.2026 14:41
26.02.2026 14:47
26.02.2026 14:53
26.02.2026 15:20
9.02.2026 19:10
9.02.2026 19:10
9.02.2026 19:10
9.02.2026 23:37
9.02.2026 23:37
12.02.2026 20:00
12.02.2026 19:55
N/A
12.02.2026 20:05
12.02.2026 20:20
9.02.2026 19:40
9.02.2026 19:40
9.02.2026 19:40
9.02.2026 19:35
9.02.2026 19:35
9.02.2026 19:35

Click on score cells to view evaluation notes

Score Legend

1.0 Full points - criterion met perfectly
0.5 Partial points - criterion mostly met
0.0 No points - criterion not met