Benchmark Results

Comparing different LLMs vibe coding the Przeprogramowani.pl website

Generated: 2/8/2026, 8:55:57 PM • Total attempts: 15

Model Family Rankings

1

GPT-5.3-Codex

3 attempts

Average

9.5/10.0

95.0%
2

Claude Opus 4.6

3 attempts

Average

8.5/10.0

85.0%
3

Gemini 3 Pro

3 attempts

Average

6.2/10.0

61.7%
4

Kimi K2.5

3 attempts

Average

5.7/10.0

56.7%
5

GLM-4.7

3 attempts

Average

0.2/10.0

1.7%

Detailed Comparison

Click on any score to reveal the detailed scoring explanation for that criterion.

Criterion
GPT-5.3-Codex
Attempt 1
GPT-5.3-Codex
Attempt 2
GPT-5.3-Codex
Attempt 3
Claude Opus 4.6
Attempt 1
Claude Opus 4.6
Attempt 2
Claude Opus 4.6
Attempt 3
Gemini 3 Pro
Attempt 1
Gemini 3 Pro
Attempt 2
Gemini 3 Pro
Attempt 3
Kimi K2.5
Attempt 1
Kimi K2.5
Attempt 2
Kimi K2.5
Attempt 3
GLM-4.7
Attempt 1
GLM-4.7
Attempt 2
GLM-4.7
Attempt 3
Task completion time
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
Local build
1
1
1
1
1
1
1
0
1
1
1
1
0
0
0
Manual testing
1
1
1
1
1
1
1
0
1
1
1
0
0
0
0
Tech stack
1
1
1
1
1
1
1
0
0.5
0.5
0.5
0
0
0
0.5
O nas page
1
1
0.5
1
0.5
1
1
1
0.5
0.5
1
0
0
0
0
Podcast page
1
1
1
0.5
0.5
1
0.5
0.5
0.5
0
0
0
0
0
0
YouTube page
1
1
1
0.5
0
1
0.5
0.5
0
0
0.5
0
0
0
0
Kursy section
1
1
1
1
1
1
1
1
1
1
1
0.5
0
0
0
Consistent UI
1
1
1
1
0.5
1
0.5
0
1
1
1
0
0
0
0
Responsive design
0.5
1
0.5
1
0.5
0.5
1
0
1
1
1
0
0
0
0
SEO Tags
1
1
1
1
1
1
1
0
0.5
1
1
0.5
0
0
0

Click on score cells to view evaluation notes

Score Legend

1.0 Full points - criterion met perfectly
0.5 Partial points - criterion mostly met
0.0 No points - criterion not met