FinOps

The FinOps of thinking models: why Gemini Pro is actually 10x the price of Opus

Token prices are falling, yet AI budgets are exploding. The culprit? "Thinking tokens." Our FinOps audit shows how Gemini 3.1 Pro can cost 676x more than Ministral for the same result. Stop paying for a model's internal monologue and learn to build smarter, cheaper architectures.

In the world of LLMs, token prices are no longer a simple race to the bottom. They vary wildly, and if you aren’t paying attention to the fine print, your API bill can easily balloon by a factor of 1,000 between the leanest and the most bloated models. It’s a common misconception that higher prices always equate to a "smarter" model; in reality, cost is often a byproduct of architectural choices—specifically the rise of "reasoning" or "thinking" tokens—that may or may not add value to your specific use case.

The traditional pricing formula used to be simple: Price = Input Tokens + Output Tokens. But with the advent of thinking models, that second half has bifurcated. Now, Output Tokens = Thinking Tokens + Reply Tokens. You aren’t just paying for the answer the model gives you; you’re paying for the internal monologue it generates to get there.

The "Gollum Voice" Economy

What exactly is "thinking" in an LLM? Imagine a model talking to itself in a Gollum-like internal dialogue, shifting points of view and debating its own logic before committed to a final response. This trend exploded because it provided a marginal increase in benchmark scores, allowing providers to claim "SOTA" (State of the Art) status on paper. However, this internal chatter is hidden from the user but fully billable, turning a simple query into a recursive, token-heavy marathon.

The industry is already showing signs of "reasoning fatigue." Many major players are beginning to treat extended thinking as a bug rather than a feature for general use. We see this in the "Answer Now" button recently added to Gemini Pro, the choice between thinking and non-thinking variants of Qwen 3.5, and the sophisticated routers in GPT-5 that attempt to bypass deep reasoning for simple tasks. These features exist because models serve two contradictory masters: they must win benchmarks, but it often destroys real-life utility by introducing massive latency and cost.

Measuring the "Vibe" vs. the Wallet

To move beyond theory, we ran a controlled experiment using an n8n-based application. We used OpenRouter as our unified LLM gateway and Sentry for detailed telemetry, allowing us to log traces and extract the exact cost per request.

The test case was a domain-name evaluation tool. It took a list of domain names and target personas to determine if "Persona X" would likely visit "Domain Y." The results were eye-opening:


Model	TPS	Time	Reasoning Tokens	Out Cost	Cost vs. Cheapest
Claude Opus 4.6	12.01	2.88s	0	$0.000962	44x
Gemini 3.1 Flash Lite	41.45	1.41s	0	$0.000092	4x
Gemini 3.1 Pro (Preview)	71.47	16.78s	1.1K	$0.014800	676x
Qwen 3.5 27B	91.43	75.0s	5.9K	$0.009710	443x
GPT-oss 120B	98.77	9.29s	522	$0.000354	16x
Ministral 14B-2512	102.33	0.92s	0	$0.000022	1x

A "vibe eval" (aka this author’s non-scientific opinion) of the results showed that the outputs were virtually indistinguishable in quality across the board. Yet, Gemini 3.1 Pro was 676 times more expensive than Ministral 14B for the exact same result. Even the flagship Claude Opus 4.6 was significantly cheaper than Gemini Pro because it didn't feel the need to "think" for 16 seconds about a simple classification task.

Alternatives to Expensive Thinking

Why is internal "thinking" often unnecessary? Because a well-architected application already includes the "thinking pattern" in its algorithm. By breaking a complex problem into a reliable run of hundreds of small, independent prompts, you achieve a higher degree of accuracy than one big, clunky prompt that relies on a model’s "thinking" mode to save its own skin.

One effective strategy to prevent hallucinations without the thinking-token tax is "pre-canned thinking." This is a prompt-engineering trick that simulates a self-correction loop without the recursive overhead:

User: [Question]
Model: [Answer]
User: Are you sure?
Model (Hard-coded by you): No.
User: Then provide the correct, verified answer.
Model: [The actual, improved answer]

By forcing the model into a "self-correction" state via hard-coded interjections, you often get a much higher-quality result for a fraction of the cost of a native reasoning model.

This is the same logic behind the success of modern coding agents. They don't just "think" about code; they use live feedback loops—running linters, executing commands, and checking test results in real time. This external feedback is an integral part of Yann LeCun’s new startup, which is looking to build the missing pieces of the AI’s brain based on real-world data.

Conclusion

The lesson for the FinOps-minded developer is clear: measure the right metrics and pick the model that fits the task, not the one with the loudest marketing. If a model feels too slow, it is almost certainly too expensive.

The "Gollum Voice" Economy

Measuring the "Vibe" vs. the Wallet

Alternatives to Expensive Thinking

Conclusion

Receive insights to build AI apps at scale.