tldr:
- gpt-5.2 and gpt-5.1-codex-max have identical pass rates but solve different tasks
- 36 tasks common to both
- 12 tasks unique to each model
- gpt-5.2-pro consistently underperforms by ~7-9 percentage points
- gpt-5.2-pro has significantly more timeout issues (26 vs 7-8)
- Extended timeouts recover additional passes - using 3x timeout multiplier recovers ~5-7 passes per model
1 comments