What if instead of asking a GPT model to answer a question, you dropped it inside a game and watched what it actually did?

UX Research · OpenAI GPT Evaluation · 2026

Evaluating OpenAI's GPT Models Through a Game

A benchmark that drops 11 GPT models into the same concurrency puzzle game and watches how they solve it. Not what they know — what they do. The behavioral differences between model generations are hard to miss.

MethodGame-based evaluation

Models Tested11 OpenAI GPT models

Evaluation Runs10,001 simulations per submit

RoleDesigner · Researcher · Engineer

GPT Models Evaluated

Turn Limit Per Run

10k+

Simulations to Verify

Interaction Metrics

OpenAI ships new model generations faster than anyone can benchmark them properly. So I stopped trying to measure what they know and started watching what they do. Each model gets dropped into the same concurrency puzzle, given a set of tools and 40 turns. How they use those turns turns out to be more telling than whether they solve it.

The research questionDo GPT models fail on multi-turn coordination tasks because they lack the underlying knowledge, or because they can't maintain a consistent internal model of a dynamic system across repeated tool calls — and does extended reasoning (the o-series architecture) actually address that, or does it just produce more confident wrong answers?

The Problem with How We Compare GPT Models

"Multiple-choice benchmarks don't show you how a GPT model thinks. They only show you whether it guesses correctly."

Standard benchmarks measure what models know, not what they do when things go sideways. MMLU asks trivia. HumanEval asks for a code snippet. Both are useful, but neither tells you much about how a model behaves when it's mid-task, getting feedback it didn't expect, and has to revise what it's already done. That's the situation real agentic workflows create, and it's the one most benchmarks skip entirely.

If you're running GPT in any kind of agentic context, this gap matters. The model isn't answering questions — it's taking actions, seeing what breaks, and deciding what to do next. A multiple-choice comparison between gpt-4o and gpt-5 tells you nothing about how they differ in that situation.

Traditional Benchmarks

One-shot question and answer
No environment interaction
Accuracy as the only signal
No feedback between attempts
Cannot reveal problem-solving strategy
Easily gameable through dataset memorization

ParallelOPM Approach

Multi-turn interaction loop (up to 40 turns)
9 callable tools for board manipulation
Rich behavioral metrics beyond pass/fail
Real-time test feedback after every action
Reveals iteration strategy and uncertainty
Deterministic verification (10,001 simulations)

The Approach: Evaluating GPT Models Through Play

This comes from a basic principle in HCI: if you want to understand how someone reasons, watch what they do, not what they say they'd do. Watching a GPT model answer a question about concurrency is not the same as watching it try to actually solve one.

The game is Parallel, a concurrency coordination puzzle. Threads move along colored tracks on a grid. The player places semaphores and signals to stop them from colliding at delivery points. The puzzle requires genuine reasoning about timing and mutual exclusion — there's nothing to look up, and brute force doesn't work. Every model ran the same board.

Why this works better than standard GPT benchmarks

The pass/fail condition is binary: 10,001 simulation runs, one failure and the solution doesn't count. But the useful data isn't the final score. It's everything in between — which tools the model called, how often it tested before committing, whether it re-read the board when it got stuck. That interaction trace separates gpt-4o from gpt-5 more clearly than any leaderboard number I've found.

Building the harness was also a UX problem in itself. The system prompt had to describe a complex, dynamic state space well enough that the model could reason about it accurately. Tool definitions had to be semantically identical across model variants, since even small schema differences would skew the comparison. Getting that wrong would invalidate everything downstream.

How Parallel Works

Level 8 ("Cherry") is an 11×10 grid with three threads running continuously on colored tracks. Each picks up packages and delivers them to a target cell. The problem is that two threads delivering at the same time causes one to fail. Schedules are randomized, so any solution has to hold across thousands of different timing scenarios.

Interactive — Parallel game board simulation

The model must place semaphores and signals to prevent threads from colliding at delivery points. Too many components wastes turns. Too few and threads collide.

Tick: 0

The Tool Interface

Models interact through nine tools. No free-text answers, no descriptions of what the model would hypothetically do. It places a component or it doesn't. That commitment is what makes the interaction log diagnostic.

// The 9 tools available to every model (identical across providers)
add_semaphore(row, col, initial_state)    // Place a binary lock
add_signal(row, col)                           // Place a trigger signal
link_signal(signal_id, target_id)              // Connect signal to semaphore
unlink_signal(signal_id)                       // Disconnect a signal
remove_element(id)                              // Delete a user-placed component
set_semaphore_state(id, state)                // Toggle initial state
get_board_state()                                // Retrieve current board (uncertainty proxy)
run_test()                                       // Single simulation (test hypothesis)
submit()                                         // Verify across 10,001 simulations
            

get_board_state() turned out to be a useful signal. A model that calls it repeatedly is re-reading the board because it's lost track of what it placed. gpt-5-mini called it 25 times in one failed run. Models that solved cleanly called it once, or not at all.

Methodology: Fairness as a Design Constraint

The hardest part of building this wasn't the game engine. It was fairness. Different model generations handle tool schemas differently, interpret token budgets differently, and have different defaults for how much they reason before acting. Any of that could tilt the comparison without actually measuring reasoning ability. So everything possible was locked down: one prompt, one schema, one turn limit, one RNG seed for verification.

The budget constraint shaped everything

Running 11 models × 3 runs each, with verification at 10,001 simulations per submit call, isn't free. Reasoning models like o3-mini averaged over a million input tokens per run. The full dataset cost several hundred dollars in API calls. That's not a lot by enterprise research standards, but it was a hard ceiling that directly shaped the study design.

Three runs per model was the chosen tradeoff. It's enough to catch obvious inconsistency — a model that solves twice but fails once, or fails twice but solves once — but it's not enough to make strong statistical claims. One level (Level 8) instead of the full six available in the game. No retries on crashed runs. The gpt-4o infrastructure errors that wiped three runs were simply lost. These constraints are worth naming because they affect how much confidence to put in the results.

What broke during development

The gated submission rule wasn't in the original design. Early runs showed models submitting immediately after placing a single element — before running any test. They were confident without evidence. The gate (you must pass at least one test before each submit) was added specifically to stop that pattern, because without it the submit logs were noise. o4-mini still found a way to break it: it ran one test, got a pass, then submitted 17 times in a row. The gate blocked 13 of them.

The system prompt went through several iterations too. The first version was around 80 lines. Models couldn't reliably track which elements they'd placed or understand what a semaphore state change actually did to thread movement. The final prompt is 280+ lines — not because the game is that complicated, but because getting a model to hold an accurate mental map of a dynamic grid state across 40 turns requires more scaffolding than expected. That iteration cost time and API budget before a single comparison run.

Identical system prompt

A single SYSTEM_PROMPT constant (280+ lines) is used for all models. No branching, no per-model modifications. The prompt describes the board, the tools, the failure conditions, and the evaluation criteria in structured natural language.

Semantically equivalent tools across model variants

Tool definitions are normalized across all model variants at runtime. gpt-4o, o-series, and gpt-5 all receive semantically identical schemas. The same action produces the same board change regardless of which model calls it.

Fixed turn limit (40 turns)

Every model gets exactly 40 turns. Each turn is one model response, potentially containing multiple tool calls. A model that solves in 10 turns is more efficient than one that solves in 30 — not luckier.

Deterministic verification seed

Submit runs use a fixed RNG seed ('parallelRNGSeed01') to generate 10,000 randomized thread schedules. Every model is tested against the same 10,001 simulations. Same timing scenarios, no exceptions.

Gated submission

Models cannot submit without at least one passing test. This prevents gaming via brute-force submits and ensures the submit result reflects a genuine solution attempt rather than noise.

Results: 11 OpenAI Models on the Same Puzzle

Eleven models, three runs each, same puzzle. The spread was wider than I expected. Some models never passed a single test across all 40 turns. Others solved it in under 12.

Model Performance 11 OpenAI GPT models · 3 runs each · Level 8 "Cherry" · 40 turn limit

All 3 runs solved

Partial (1–2 of 3)

Zero runs solved

Average turns used per run. Models that solved in fewer turns demonstrate more targeted reasoning. Models hitting the cap (40) exhausted the budget without converging.

Each dot is one run. Position shows turns used vs. input tokens consumed. Size encodes thinking tokens. Green = solved.

Input Tokens (000s) Turns Used

Where this diverges from standard benchmarks

The central claim is that this reveals things standard benchmarks miss. That claim only holds if the rankings actually diverge. Here's what the divergence looks like: on reasoning-heavy benchmarks (GPQA, MATH), the o-series models — particularly o3 and o3-mini — sit near the top. On coding benchmarks like SWE-bench, gpt-5.4-nano would rank well below its bigger siblings. The expected order going into this was roughly: o3 ≈ gpt-5 > o3-mini > gpt-5.4 > gpt-4o > gpt-4.1.

That's not what happened.

Model Expected (standard benchmarks) Actual (this task)

gpt-5.4-nano Lowest in gpt-5.4 family. Cheap, small, not a flagship. 3/3 — 100% solve rate. Most efficient model in dataset.

o3-mini High reasoning benchmark scores. OpenAI's extended thinking model. 0/3 — failed all runs. 202k avg thinking tokens. Never converged.

gpt-4o Scores ~88% on MMLU. Strong general-purpose baseline. 0/3 — zero test passes across all runs. Indistinguishable from random placement.

gpt-5.4-mini Mid-tier in gpt-5.4 family. Below gpt-5.4 flagship. 3/3 — run 2 solved in 5 turns, 80k tokens. Most efficient solve in dataset.

o3 Top-tier reasoning model. Expensive. Designed for hard problems. 2/3 — efficient when it solved, but still dropped a run. Not the clear winner here.

The most interesting divergence isn't o3-mini failing — it's gpt-5.4-nano winning. A model designed to be cheap and fast, not deep, outperformed every o-series model on a task those models were ostensibly built for. The standard benchmark hierarchy, where extended reasoning architectures sit above compact fast models, doesn't hold here.

One important caveat: this is one level, one puzzle topology, three runs. It's possible gpt-5.4-nano has seen training data particularly favorable to this type of grid coordination, or that this specific level's structure happens to suit its reasoning pattern. I can't rule that out with 3 runs. What I can say is that the ranking was unexpected, and the interaction logs show gpt-5.4-nano operating with genuine economy — few placements, minimal revisions, first-submit success. That behavioral profile doesn't look like luck.

Key Findings

The number that surprised me most wasn't in the solve rate. o3-mini, OpenAI's deep reasoning model, used over 180,000 thinking tokens per run and solved none of the three attempts. gpt-5.4 used 18,000 and solved all three. More thinking didn't mean better results. What actually separated the models was how efficiently they iterated.

GPT-4 vs GPT-5: a cliff, not a slope

GPT-4 generation models (gpt-4o, gpt-4.1, gpt-4o-mini) scored 0/3. GPT-5 generation (gpt-5, gpt-5.4 variants) scored 3/3. It wasn't a gradual improvement. Standard benchmarks show these generations as meaningfully different; this showed them as categorically different.

Test-driven iteration predicts success

Every model that eventually solved the puzzle passed a test before turn 15. gpt-4o and gpt-4o-mini placed components across all 40 turns without passing a single test. They weren't making progress. They were iterating without any signal that anything was working.

o-series thinking tokens ≠ success

o3-mini averaged 182k thinking tokens per run and solved nothing. gpt-5.4 averaged 11k and solved everything. Spending more compute on reasoning didn't help here. If anything, the models that overthought it placed too many components and couldn't converge.

Solution efficiency varies significantly

Winning solutions used 4 to 9 components. gpt-5.4 averaged 7, placed with almost no revisions. Some messier runs that eventually passed had twice as many elements placed and removed along the way. Fewer components usually meant the model knew what it was doing.

Board state calls reveal uncertainty

In its failed run, gpt-5-mini called get_board_state() 25 times. It kept re-reading the board without getting any closer to a solution. Models that solved cleanly called it once or not at all. Knowing you're lost is useful. Re-reading the same state 25 times is not.

First-turn latency tells a story

Every model that solved the puzzle placed its first component on turn 1. Models that spent two or three turns thinking before doing anything never recovered. Planning without testing is just guessing at a slower pace.

Tracing a Winning Run: gpt-5 on Level 8

This is a successful gpt-5 run, turn by turn. Bar height is the number of tool calls in that turn. Color shows what kind of action. Click any bar to see what actually happened.

Turn-by-turn replay · gpt-5 · Level 8 · Run 1 · Solved in 11 turns

← Click a turn bar to inspect it

Legend

■ Placement action ■ Link / config ■ Test (passed) ■ Test (failed) ■ Submit

What this study can't tell you

Any honest research project has a list of things it didn't get to do, couldn't control for, or chose not to pursue. Here's mine.

3 runs per model is not statistically robust

This is the most significant limitation. gpt-5-mini solved 1/3 runs — but whether that's a 33% solve rate or a fluke is genuinely unclear. o4-mini solved 2/3, which looks like partial success, but a fourth run could have gone either way. The budget allowed 3 runs per model and no more. Results should be read as directional, not definitive. The models that solved 0/3 or 3/3 are cleaner findings; anything in between is uncertain.

One level, one puzzle topology

Every finding here is from Level 8, "Cherry" — an 11×10 grid with 3 threads. The game has six levels with different grid sizes, thread counts, and coordination mechanics. It's possible gpt-5.4-nano is particularly suited to this specific topology. It's possible o3-mini would perform better on a simpler 2-thread level, or catastrophically worse on a 4-thread one. Without running multiple levels, the generalizability of the capability ranking is unknown.

Uncontrolled confounds

Context window sizes differ between models. Training data overlap with concurrency or grid-based puzzles is unknown. API rate limiting affected wall-clock time for some runs. The o-series models (o3, o3-mini) use a fundamentally different generation architecture from gpt-5.4 — sequential chain-of-thought vs parallel token generation — and that architectural difference alone might explain some behavioral differences that have nothing to do with reasoning quality on this task.

Task specificity — this is one domain

Concurrency coordination is a narrow domain. Whether performance here predicts performance in other agentic contexts — customer support automation, code generation pipelines, research assistants — is an open question. A model that fails at semaphore placement might be excellent at multi-step document analysis. The findings describe how these models handle this specific type of problem, not agentic capability in general.

The cognitive strategy taxonomy is one person's interpretation

The four patterns — Systematic Planners, Adaptive Explorers, Overcomputers, Random Walkers — came from reading thinking logs across 34 runs. No second coder validated the categories. No inter-rater reliability was calculated. The taxonomy is useful for structuring what I observed, but it shouldn't be treated as an established classification. Different readers might carve the data differently.

Lost data from infrastructure errors

Three gpt-4o runs crashed with an "initialContent not defined" error and produced no usable data. Those runs were not retried due to cost. The gpt-4o results here are from 3 valid runs, not the 6 that were attempted. If the crashed runs had behaved differently, the picture for gpt-4o might look different — though given 0 test passes across all valid runs, it would have to behave very differently to change the overall finding.

None of these limitations make the findings wrong. They define the confidence radius around the findings. "GPT-5 generation models solved this; GPT-4 generation models didn't" is robust even under 3 runs, because the results are 0/9 vs 9/9. "o3-mini underperforms its benchmark ranking" is directionally credible but would need more runs and more levels to claim firmly. That's the honest accounting.

How Models Think

Reasoning models expose their internal monologue through thinking tokens — text the model generates before producing a response. This gave me something closer to a window into strategy than a typical benchmark allows. Reading through the logs, I noticed four rough groupings in how models approached the problem. These aren't validated categories — they're patterns from 34 runs across 8 models, read by one person. With more runs and multiple levels you'd want to test whether they hold. But they're consistent enough within this dataset to be worth describing.

Four patterns in how models approached the problem

Systematic Planners — o3, gpt-5.4-mini

These models trace all thread paths before touching the board. o3's first-turn thinking reads like a graph traversal: "Color 1 starts at [7,2] and moves east... the thread eventually goes west to [0,0], picking up a cell, then continues south down column 0 to [7,0] for an exchange." No placement until the full route is understood. Once placed, almost nothing is removed.

Adaptive Explorers — gpt-5, gpt-5.4, gpt-5.4-nano

These models plan, then test frequently to confirm. gpt-5.4's thinking: "Let's break down the paths... they share the center column and switch at [0,4] and [7,4]." High test-call counts but they converge. The risk is over-fitting to single-threaded timing, which caught gpt-5 in run 1.

Overcomputers — o3-mini, o4-mini

Maximum thinking volume, minimum convergence. o3-mini generated 202,000 thinking tokens per run on average — 97% of its total output — and solved nothing. Its reasoning correctly identifies constraints ("Only T1 can pick up the item") but never synthesizes them into a solution that holds under nondeterministic scheduling. Thinking at volume is not the same as thinking well.

Random Walkers — gpt-4o, gpt-4.1, gpt-4o-mini

No thinking tokens at all. These models produce no internal reasoning trace — they respond directly. The behavioral result is exactly what you'd expect: place a component, test, fail, remove, repeat. gpt-4.1 made 37 combined test calls across three runs and passed zero. Not because it placed wrong things, but because it had no causal model of how placement affects thread timing.

What the thinking excerpts actually say

Clicking between models in the turn replay is informative. But reading the first-turn thinking — before any tool is called — is more revealing. This is the moment where strategy is formed. Below are verbatim excerpts from each reasoning model's initial response.

Three failure modes worth naming

Timing hallucination — gpt-5 run 1

gpt-5 passed 34 of 36 single-threaded tests in run 1 and submitted. The solution failed. What happened: the model reasoned about a specific tick-by-tick sequence and built a solution that worked for that sequence. It didn't account for nondeterministic thread interleaving, where any thread can advance at any relative rate. A 94% single-thread pass rate gave it false confidence. The verification suite saw through it in milliseconds.

Premature confidence — o4-mini run 1

After placing 4 elements and running one test, o4-mini submitted 17 times in a row. 13 of those submits were blocked by the gating rule (you must pass a test before each submit). The model either didn't read or didn't internalize the constraint. Its thinking describes the solution confidently — "I'll use semaphores to ensure arrow1 arrives before arrow2" — but it never verifies whether the placement actually produces that behavior.

Causal blindness — gpt-4.1

gpt-4.1 generates detailed output about which semaphore to place and where. The reasoning reads coherently. But it never passes a test — not in run 1, run 2, or run 3. 37 combined test calls, zero passes. The model can describe a semaphore's purpose without being able to simulate how placing one at a specific coordinate actually changes thread timing. It's the difference between knowing what a lock does and knowing where to put it.

The nano surprise

gpt-5.4-nano was the most surprising result in the dataset. It's the smallest and cheapest model in the gpt-5.4 family. It averaged 293,000 input tokens per run — about a fifth of what o3-mini consumed — and solved all three runs. Its thinking excerpts are short and direct: "I need to make a tool call right away, so I'll start by inspecting the board... I want to ensure that I get a clear understanding of the current situation before proceeding." Then it acts.

One run took 5 turns and 86,000 input tokens total. That's the second most efficient solve in the dataset. In run 3 it placed 6 elements, removed none, and submitted first try. The thinking per turn averaged 3,400 characters — compared to gpt-5.4-mini's 7,000+ — but the solutions were equally correct.

What nano demonstrates is that thinking at length doesn't make a solution more robust. The 10,001-simulation verification is indifferent to how much a model deliberated. Either the semaphores are in the right places or they're not. nano found the right places faster than almost every other model, with a fraction of the compute.

Thinking volume vs thinking quality

o3-mini — most thinking, zero solves

Avg thinking tokens / run 202,000

Thinking as % of output 97%

Avg elements removed / run 6 (high churn)

Avg wall-clock time / run 28 minutes

Solve rate 0 / 3

gpt-5.4-nano — least thinking, 100% solve rate

Avg thinking tokens / run 19,000

Thinking as % of output 85%

Avg elements removed / run 2 (low churn)

Avg wall-clock time / run 1.5 minutes

Solve rate 3 / 3

The most efficient single solve in the entire dataset was gpt-5.4-mini run 2: 5 turns, 80,000 input tokens, first-submit success. That run had the longest average thinking per turn of any model (close to 10,000 characters per turn), but it thought deeply for two turns and then stopped. It didn't keep thinking. It acted, tested once, and submitted. The thinking was dense, not voluminous.

o3-mini's problem wasn't that it thought too long in any one turn — its per-turn thinking averaged around 2,000 characters, similar to other models. The problem was that it thought for 38 or 39 turns out of 40, every run, without the thinking leading to better placements. It was generating analysis without updating its strategy.

What This Tells Us About OpenAI's Model Lineup

The gpt-5 generation doesn't just score better on this puzzle. It works differently. It acts faster, tests more cleanly, and places fewer components to get there. That's hard to see in a leaderboard score. It's hard to miss in a 40-turn interaction log.

If you're picking a model for an agentic pipeline

Don't assume o-series means better. On coordination tasks, gpt-5.4-nano outperformed o3-mini in both solve rate and cost. o3-mini used 25× more tokens and solved nothing. Run your own task-specific evaluation before paying for extended reasoning — it may not be doing what you think it's doing.

If you're evaluating models for product decisions

MMLU and benchmark scores tell you roughly where a model sits in the hierarchy, but they won't predict tool-use behavior under feedback. gpt-4o scores ~88% on MMLU and passed zero tests across nine runs here. The gap between "knows the answer" and "executes a multi-step solution correctly" is real, and standard benchmarks don't measure it.

If you're building evaluation frameworks

A closed environment, deterministic success criteria, and full interaction logging is a replicable pattern. This isn't specific to concurrency games — any domain where a model takes sequential actions, gets feedback, and must iterate to a verifiable goal works. The method costs money per run, which is a real constraint, but the behavioral data you get back justifies it for high-stakes model selection.

The broader finding

Thinking token volume is not a reliable proxy for reasoning quality on coordination tasks. The model that thought the least per run won the most. The model that thought the most lost every time. This doesn't mean extended reasoning is useless — o3 solved 2/3 efficiently — but it does mean the relationship between token spend and outcome quality is task-dependent in ways that aren't obvious from benchmark scores alone.