Benchmark Results

Comparing different LLMs vibe coding the Przeprogramowani.pl website

Generated: 2/8/2026, 8:55:57 PM • Total attempts: 15

Model Family Rankings

3 attempts

Average

9.5/10.0

95.0%

3 attempts

Average

8.5/10.0

85.0%

3 attempts

Average

6.2/10.0

61.7%

3 attempts

Average

5.7/10.0

56.7%

3 attempts

Average

0.2/10.0

1.7%

Click on any score to reveal the detailed scoring explanation for that criterion.

Criterion	GPT-5.3-Codex Attempt 1	GPT-5.3-Codex Attempt 2	GPT-5.3-Codex Attempt 3	Claude Opus 4.6 Attempt 1	Claude Opus 4.6 Attempt 2	Claude Opus 4.6 Attempt 3	Gemini 3 Pro Attempt 1	Gemini 3 Pro Attempt 2	Gemini 3 Pro Attempt 3	Kimi K2.5 Attempt 1	Kimi K2.5 Attempt 2	Kimi K2.5 Attempt 3	GLM-4.7 Attempt 1	GLM-4.7 Attempt 2	GLM-4.7 Attempt 3
Task completion time	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A
Local build	1	1	1	1	1	1	1	0	1	1	1	1	0	0	0
Manual testing	1	1	1	1	1	1	1	0	1	1	1	0	0	0	0
Tech stack	1	1	1	1	1	1	1	0	0.5	0.5	0.5	0	0	0	0.5
O nas page	1	1	0.5	1	0.5	1	1	1	0.5	0.5	1	0	0	0	0
Podcast page	1	1	1	0.5	0.5	1	0.5	0.5	0.5	0	0	0	0	0	0
YouTube page	1	1	1	0.5	0	1	0.5	0.5	0	0	0.5	0	0	0	0
Kursy section	1	1	1	1	1	1	1	1	1	1	1	0.5	0	0	0
Consistent UI	1	1	1	1	0.5	1	0.5	0	1	1	1	0	0	0	0
Responsive design	0.5	1	0.5	1	0.5	0.5	1	0	1	1	1	0	0	0	0
SEO Tags	1	1	1	1	1	1	1	0	0.5	1	1	0.5	0	0	0

Click on score cells to view evaluation notes

1.0 Full points - criterion met perfectly

0.5 Partial points - criterion mostly met

0.0 No points - criterion not met