10x Bench Results

See how LLMs perform at coding the Przeprogramowani.pl website in Astro + React + Tailwind + Cloudflare stack.

117 evaluated attempts • Last updated June 28, 2026

Model Family Rankings

Only latest models

GPT-5.4

5 attemptsvia Codex Desktop (High Effort)

Rate: $1.75 / $14 per 1M

91.0%

Average: 9.1/10.0

GPT-5.4

5 attemptsvia Codex Desktop (High Effort)

Rate: $1.75 / $14 per 1M

Average

9.1/10.0

91.0%

GPT-5.3-Codex

10 attemptsvia Codex Desktop (High Effort)

Rate: $1.75 / $14 per 1M

85.0%

Average: 8.5/10.0

GPT-5.3-Codex

10 attemptsvia Codex Desktop (High Effort)

Rate: $1.75 / $14 per 1M

Average

8.5/10.0

85.0%

GLM-5.2

5 attemptsvia OpenCode

Rate: $1.4 / $4.4 per 1M

Avg run: $0.43 · total $2.17

85.0%

Average: 8.5/10.0

GLM-5.2

5 attemptsvia OpenCode

Rate: $1.4 / $4.4 per 1M

Avg run: $0.43 · total $2.17

Average

8.5/10.0

85.0%

DeepSeek V4 Pro

5 attemptsvia OpenCode

Rate: $0.435 / $0.87 per 1M

Avg run: $0.12 · total $0.62

84.0%

Average: 8.4/10.0

DeepSeek V4 Pro

5 attemptsvia OpenCode

Rate: $0.435 / $0.87 per 1M

Avg run: $0.12 · total $0.62

Average

8.4/10.0

84.0%

Qwen 3.6

3 attemptsvia Claude Code (High Effort)

Rate: $0.15 / $1 per 1M

83.3%

Average: 8.3/10.0

Qwen 3.6

3 attemptsvia Claude Code (High Effort)

Rate: $0.15 / $1 per 1M

Average

8.3/10.0

83.3%

Claude Fable 5

5 attemptsvia Claude Desktop

Rate: $10 / $50 per 1M

83.0%

Average: 8.3/10.0

Claude Fable 5

5 attemptsvia Claude Desktop

Rate: $10 / $50 per 1M

Average

8.3/10.0

83.0%

GPT-5.5

5 attemptsvia Codex Desktop (High Effort)

Rate: $5 / $30 per 1M

78.0%

Average: 7.8/10.0

GPT-5.5

5 attemptsvia Codex Desktop (High Effort)

Rate: $5 / $30 per 1M

Average

7.8/10.0

78.0%

Gemini 3.5 Flash

5 attemptsvia OpenCode

Rate: $1.5 / $9 per 1M

75.0%

Average: 7.5/10.0

Gemini 3.5 Flash

5 attemptsvia OpenCode

Rate: $1.5 / $9 per 1M

Average

7.5/10.0

75.0%

Claude Sonnet 4.6

5 attemptsvia Claude Code (High Effort)

Rate: $3 / $15 per 1M

71.0%

Average: 7.1/10.0

Claude Sonnet 4.6

5 attemptsvia Claude Code (High Effort)

Rate: $3 / $15 per 1M

Average

7.1/10.0

71.0%

Claude Opus 4.7

5 attemptsvia Claude Desktop

Rate: $5 / $25 per 1M

70.0%

Average: 7.0/10.0

Claude Opus 4.7

5 attemptsvia Claude Desktop

Rate: $5 / $25 per 1M

Average

7.0/10.0

70.0%

Minimax M2.5

5 attemptsvia OpenCode

Rate: $0.3 / $2.4 per 1M

69.0%

Average: 6.9/10.0

Minimax M2.5

5 attemptsvia OpenCode

Rate: $0.3 / $2.4 per 1M

Average

6.9/10.0

69.0%

Gemini 3.1 Pro

5 attemptsvia Cursor

Rate: $2 / $12 per 1M

67.0%

Average: 6.7/10.0

Gemini 3.1 Pro

5 attemptsvia Cursor

Rate: $2 / $12 per 1M

Average

6.7/10.0

67.0%

Grok Code Fast 1

5 attemptsvia OpenCode

Rate: $0.2 / $1.5 per 1M

59.0%

Average: 5.9/10.0

Grok Code Fast 1

5 attemptsvia OpenCode

Rate: $0.2 / $1.5 per 1M

Average

5.9/10.0

59.0%

Qwen 3 Max

3 attemptsvia OpenCode

Rate: $1.2 / $6 per 1M

45.0%

Average: 4.5/10.0

Qwen 3 Max

3 attemptsvia OpenCode

Rate: $1.2 / $6 per 1M

Average

4.5/10.0

45.0%

Devstral 2

3 attemptsvia OpenCode

Rate: $0.4 / $2 per 1M

16.7%

Average: 1.7/10.0

Devstral 2

3 attemptsvia OpenCode

Rate: $0.4 / $2 per 1M

Average

1.7/10.0

16.7%

Detailed Comparison

Review model averages first, switch to all attempts for scoring notes, or compare two selected models below.

Score Legend

1.0Full points - criterion met perfectly

0.5Partial points - criterion mostly met

0.0No points - criterion not met

Filters:

Showing one averaged column per model.

Criterion	GPT-5.4 5 attempts avg	GLM-5.2 5 attempts avg	GPT-5.3-Codex 10 attempts avg	DeepSeek V4 Pro 5 attempts avg	Qwen 3.6 3 attempts avg	Claude Fable 5 5 attempts avg	GPT-5.5 5 attempts avg	Gemini 3.5 Flash 5 attempts avg	Claude Sonnet 4.6 5 attempts avg	Claude Opus 4.7 5 attempts avg	Minimax M2.5 5 attempts avg	Gemini 3.1 Pro 5 attempts avg	Grok Code Fast 1 5 attempts avg	Qwen 3 Max 3 attempts avg	Devstral 2 3 attempts avg
Local build	1 5 attempts	1 5 attempts	1 10 attempts	1 5 attempts	1 3 attempts	1 5 attempts	1 5 attempts	1 5 attempts	1 5 attempts	1 5 attempts	1 5 attempts	1 5 attempts	1 5 attempts	0.67 3 attempts	0.33 3 attempts
Manual testing	1 5 attempts	1 5 attempts	1 10 attempts	1 5 attempts	1 3 attempts	1 5 attempts	1 5 attempts	1 5 attempts	1 5 attempts	1 5 attempts	1 5 attempts	1 5 attempts	0.60 5 attempts	0.67 3 attempts	0.33 3 attempts
Tech stack	1 5 attempts	0.50 5 attempts	0.90 10 attempts	0.40 5 attempts	0.33 3 attempts	0 5 attempts	0.40 5 attempts	0.20 5 attempts	0.10 5 attempts	0.10 5 attempts	0.90 5 attempts	0.70 5 attempts	0.30 5 attempts	0.17 3 attempts	0 3 attempts
O nas page	0.90 5 attempts	0.50 5 attempts	0.55 10 attempts	1 5 attempts	0.83 3 attempts	0.90 5 attempts	0.60 5 attempts	0.60 5 attempts	0.50 5 attempts	0.50 5 attempts	0.50 5 attempts	0.40 5 attempts	0.50 5 attempts	0.33 3 attempts	0.17 3 attempts
Podcast page	0.70 5 attempts	0.50 5 attempts	0.70 10 attempts	0.50 5 attempts	0.67 3 attempts	0.50 5 attempts	0.50 5 attempts	0.50 5 attempts	0.50 5 attempts	0.50 5 attempts	0.50 5 attempts	0.40 5 attempts	0.50 5 attempts	0.33 3 attempts	0.17 3 attempts
YouTube page	1 5 attempts	1 5 attempts	1 10 attempts	0.90 5 attempts	0.50 3 attempts	0.90 5 attempts	0.90 5 attempts	0.70 5 attempts	0.50 5 attempts	0.50 5 attempts	0.70 5 attempts	0.40 5 attempts	0.90 5 attempts	0.67 3 attempts	0.17 3 attempts
Kursy section	0.70 5 attempts	1 5 attempts	0.65 10 attempts	1 5 attempts	1 3 attempts	1 5 attempts	0.60 5 attempts	0.60 5 attempts	0.60 5 attempts	0.50 5 attempts	0.50 5 attempts	0.50 5 attempts	0.50 5 attempts	0.33 3 attempts	0.17 3 attempts
Consistent UI	1 5 attempts	1 5 attempts	1 10 attempts	1 5 attempts	1 3 attempts	1 5 attempts	1 5 attempts	1 5 attempts	1 5 attempts	1 5 attempts	0.80 5 attempts	0.80 5 attempts	0.60 5 attempts	0.67 3 attempts	0.33 3 attempts
Responsive design	0.80 5 attempts	1 5 attempts	0.95 10 attempts	1 5 attempts	1 3 attempts	1 5 attempts	0.90 5 attempts	1 5 attempts	1 5 attempts	0.90 5 attempts	0.50 5 attempts	1 5 attempts	0.40 5 attempts	0.33 3 attempts	0 3 attempts
SEO Tags	1 5 attempts	1 5 attempts	0.75 10 attempts	0.60 5 attempts	1 3 attempts	1 5 attempts	0.90 5 attempts	0.90 5 attempts	0.90 5 attempts	1 5 attempts	0.50 5 attempts	0.50 5 attempts	0.60 5 attempts	0.33 3 attempts	0 3 attempts
Penalty	0 5 attempts	0 5 attempts	N/A0 attempts	0 5 attempts	0 3 attempts	0 5 attempts	0 5 attempts	0 5 attempts	0 5 attempts	0 5 attempts	N/A0 attempts	0 5 attempts	N/A0 attempts	N/A0 attempts	N/A0 attempts

Dots show the individual attempts that make up each model average.

Model Comparison

Compare two model families by average criterion score and attempt spread.

Select models from the ranking cards or adjust the pair here.

Model A

Model B

A91.0%

GPT-5.4

9.1/10.0 avg across 5 attempts

Overall delta

A +6.0 pts

B85.0%

GPT-5.3-Codex

8.5/10.0 avg across 10 attempts

Criterion	GPT-5.4	Difference	GPT-5.3-Codex
Local build	1	Tie	1
Manual testing	1	Tie	1
Tech stack	1	A +0.10	0.90
O nas page	0.90	A +0.35	0.55
Podcast page	0.70	Tie	0.70
YouTube page	1	Tie	1
Kursy section	0.70	A +0.05	0.65
Consistent UI	1	Tie	1
Responsive design	0.80	B +0.15	0.95
SEO Tags	1	A +0.25	0.75
Penalty	0	N/A	N/A

10x Bench Results

Model Family Rankings

GPT-5.4

GPT-5.4

GPT-5.3-Codex

GPT-5.3-Codex

GLM-5.2

GLM-5.2

DeepSeek V4 Pro

DeepSeek V4 Pro

Qwen 3.6

Qwen 3.6

Claude Fable 5

Claude Fable 5

GPT-5.5

GPT-5.5

Gemini 3.5 Flash

Gemini 3.5 Flash

Claude Sonnet 4.6

Claude Sonnet 4.6

Claude Opus 4.7

Claude Opus 4.7

Minimax M2.5

Minimax M2.5

Gemini 3.1 Pro

Gemini 3.1 Pro

Grok Code Fast 1

Grok Code Fast 1

Qwen 3 Max

Qwen 3 Max

Devstral 2

Devstral 2

Individual Attempts

Detailed Comparison

Score Legend

Model Comparison

GPT-5.4

GPT-5.3-Codex