
Give any solver enough evaluations and the question of which one you picked stops mattering. The interesting regime is the opposite one — where the budget is tight enough that the method, not the math, decides the outcome.
Combinatorial optimization benchmarks usually report the best solution found given ample evaluations. That is the right axis for a solver that runs offline against a cheap objective. It is the wrong axis for a loop where each evaluation is a radar dwell, a shot on shared quantum hardware, a destructive test, or a power-flow solve — and where an answer is due on a deadline that does not grow just because the problem did. There, you do not get ample evaluations. You get a starved budget.
Simulated annealing is the right tool across a huge range of problems, and on equal-measurement footing it is hard to beat. But it is sequential: it proposes a move, evaluates it, accepts or rejects, repeats. Under a budget of roughly one evaluation per variable, that loop cannot accumulate enough accepted moves to climb — its solution quality goes flat with dimension, stuck near a floor, because a bigger problem simply buys it proportionally fewer moves. It is not mistuned. It has run out of clock.
The runtime spends the same starved budget differently: as many cheap parallel-sampling rounds, each updating the whole configuration at once. Its quality climbs with dimension rather than flattening. So as the problem grows, the two curves cross. In our benchmarks the crossover sits around 256 to 512 variables on coupled problems, and the margin widens past it — statistically significant, growing with size, on fresh seeds with bootstrap confidence intervals.
This is a regime result, and the regime is narrow on purpose. The advantage holds when the budget is genuinely starved — one to two evaluations per variable. Give the annealer five per variable and it wins again; that is the right tool there, and we will tell you so. The absolute quality under starvation is modest for everyone — one evaluation per variable is hard — so the claim is graceful degradation, not that the problem is solved. And it is the large-n corner: below the crossover, sequential search wins.
What makes it credible is exactly that it is bounded. A method that claimed to beat annealing everywhere would be making a claim no honest benchmark supports. This one beats it in a specific, stateable place — the place a lot of real instruments actually operate — and concedes the rest.