For years, AI capability scaled with model size and training data. In 2024 those returns started slowing. The new scaling axis is test-time compute: letting models think longer at inference time. Reasoning models like o1, o3, and DeepSeek R1 prove that thinking time can substitute for raw model size on hard problems.

From 2020 to 2024, the dominant story in AI was scaling laws: bigger models trained on more data with more compute produced predictably better results. That story shifted in late 2024 when OpenAI released o1, a model that performed dramatically better on math, code, and reasoning benchmarks not because it was bigger, but because it generated extensive chain-of-thought reasoning before producing answers. This was test-time compute scaling: spending more compute per query to get better outputs from the same underlying model weights. The o3 and o3-pro models extended this further, sometimes generating thousands or tens of thousands of internal reasoning tokens before responding. DeepSeek R1, released as open-source in early 2025, demonstrated that the technique generalized beyond OpenAI. Anthropic's Claude with extended thinking and Google's Gemini reasoning modes followed similar patterns. The mechanism is reinforcement learning on reasoning traces: models are trained to generate reasoning that leads to verified-correct answers, with rewards shaping not just the final answer but the structure of the reasoning chain. Test-time compute has its own emerging scaling laws. Performance on hard reasoning tasks improves with compute spent thinking, with returns continuing further than dense pretraining scaling. This has reshaped the economic structure of AI: instead of one massive training run followed by cheap inference, providers now spend significant compute per query, particularly for premium reasoning queries. The implications go beyond benchmarks. For tasks where correctness can be verified — math, code, formal logic — reasoning models approach or exceed expert human performance. For tasks where correctness is harder to define, reasoning models help less. Understanding when reasoning models add value, and when they're over-engineered for the task, is now a core skill for AI practitioners.

Why Test-Time Compute Is the New Scaling Frontier