Four signals, one week
Last week I wrote about the supply side of AI's physical reality — Google couldn't sell Meta all the compute it wanted to buy. This week, the demand side showed its hand, four times in seven days:
- The 20VC × SaaStr crew reported that companies quintupled their token spend in the first half of the year — and named the resulting squeeze what it is: a token-ROI crisis.
- Anthropic entered talks with Samsung to build a custom AI chip — about a week after OpenAI announced its own custom silicon partnership with Broadcom. Both frontier labs, same move, same month.
- A cheap Chinese model from Zhipu landed fourth on one of the industry's most closely watched intelligence rankings.
- Anthropic marketed its newest Sonnet release not on benchmark supremacy but as “better at the tasks that are running up enterprise bills.”
Any one of these is a Tuesday. Together they're a regime change.
The axis flipped
For three years, the AI race had one scoreboard: capability. Whoever shipped the smartest model won the week, the funding round, and the enterprise deal. Cost was a footnote — something you'd optimize later, once the magic was proven.
“Later” arrived. When your token spend quintuples in six months, the question in the budget meeting stops being “which model is smartest?” and becomes “what does a unit of useful work cost, and why is ours so expensive?” Every player in the stack is now answering that question. The labs are going vertical into silicon — the same playbook hyperscalers ran with TPUs and Graviton — because owning the chip is the only durable way to cut the cost floor. Challengers are attacking from below with models that are 80% as capable at a fraction of the price. And model marketing now leads with your cloud bill.
The scoreboard changed: not the smartest model — the cheapest useful unit of work.
Why this was inevitable (and why it's healthy)
Two scissors blades cut here. Last week's essay covered the first: physical supply — chips, power, cooling — can't expand fast enough, so capacity is rationed even at the top of the market. The second blade is this week's: demand grew into real production workloads with real invoices. Agents that run for hours burn tokens in a way chatbots never did.
When supply is capped and demand is exploding, price discovery gets honest. That's not a crisis for the industry — it's maturity. Every foundational technology went through this: mainframes to commodity servers, proprietary Unix to Linux, on-prem to cloud and partially back when the bills arrived. The capability race built the magic. The cost race is how the magic becomes infrastructure.
What to actually do about it
If you build with AI, three moves follow directly — and they compound with the three from last week:
- 1Make cost-per-task a first-class metric. Not tokens per month — cost per completed unit of user value. If you can't answer “what does one resolved support ticket / generated report / closed task cost us in inference,” you're flying blind into a repricing.
- 2Design for portability, not loyalty. The price-performance leader will change repeatedly over the next 18 months — labs cutting costs with custom silicon, challengers undercutting from below. Route through an abstraction, keep evals model-agnostic, and make switching a config change, not a rewrite.
- 3Right-size relentlessly. The frontier model should be your escalation path, not your default. Most production tasks clear the bar on models that cost a tenth as much — the fourth-place model at a fraction of the price is precisely the point of that ranking.
The bottom line
The compute crunch capped the supply of intelligence; the token-ROI crisis is repricing the demand for it. In between sits every product team building on AI. The winners of the capability era were whoever demoed the most magic. The winners of the cost era will be whoever delivers the most useful work per dollar — and can prove it.
Signal over noise: stop asking which model is smartest. Start asking what your unit of work costs — because your competitors just did.