Every team running AI in production eventually hits the same question: which model do we use for this? The fast, efficient one — or the heavyweight that thinks for thirty seconds before answering?
The standard answer is to route by task complexity. Simple task, fast model. Hard task, thinking model. It sounds reasonable, and it's how most teams operate today. It's also wrong — or at least, it's only half the picture.
I've spent the past two years building an AI-native company and embedding AI into every marketing, sales, and product workflow we run — ours and our clients'. The pattern that emerged surprised me: the tasks that needed the expensive thinking models weren't reliably the hard ones. They were the underspecified ones.
You're not paying for task difficulty. You're paying for unwritten context.
The Two Axes
Here's the rule of thumb I've landed on. The more detailed and direct your prompt, the more you can use a fast model — it will just go and execute. The broader and more nuanced your prompt, even when it's intentionally open-ended, the more a high-thinking model earns its cost.
What's really happening is a question of where the reasoning lives. A detailed prompt means you've already done the decomposition: you've decided the approach, the constraints, what "good" looks like. All the model has to do is execute — and execution is what fast models are built for. A broad prompt means the model has to do that decomposition itself: figure out what you meant, which tradeoffs matter, what the success criteria are. That interpretive work is exactly what extended thinking exists to do.
So there are two axes, not one. The first is the specification gap — how much interpretation you've left to the model. The second is intrinsic difficulty — how hard the task is even when fully specified. A precisely specified, genuinely gnarly problem still deserves a thinking model. But most teams dramatically underrate the first axis. They pick models based on how hard the task feels, rather than on how much of the thinking they've already done themselves.
The Unwritten-Context Tax
Once you see model selection this way, the economics flip. Every time you hand a vague brief to a frontier model, you're renting expensive reasoning to reconstruct context that already exists somewhere in your company — in someone's head, in a doc nobody linked, in the way your best rep explains the product. The model burns tokens inferring your strategy, your voice, your customer, your constraints. Then the conversation ends, and tomorrow you pay the same tax again.
The expensive model is, in a sense, a tax on unwritten context. And unlike most taxes, this one is optional.
The teams getting the most out of AI right now aren't the ones with the biggest model budgets. They're the ones doing what Anthropic's engineers call context engineering — treating the model's input as the primary lever, and curating the smallest set of high-signal context that lets the model do the job. Most teams optimize the output side: better prompts on the day, bigger models, more retries. The compounding returns are on the input side.
Three Moves That Shift the Lever
1. Write down what you already know
The cheapest upgrade available to any team is specification. We run an agent that enriches our development backlog — reads raw tickets, explores the codebase, writes structured requirements. We launched it entirely on a fast model. Not because the work is simple, but because the ticket template carries the judgment: what a good ticket contains, how to classify it, what to check. The only piece we earmarked for a heavier model was the genuinely ambiguous part — deciding which tickets belong together. Difficulty earned the upgrade there. Everywhere else, the spec had already done the thinking.
2. Build Claude skills once — then run the daily work on a fast model
This is where Claude skills change the math. A skill is a structured document that encodes voice, strategy, decision rules, and process — context the model loads when it's relevant. Building a good one is genuinely hard work: it requires synthesizing how you think, what you'd never do, what good looks like. That's a job for a high-thinking model, and it's worth every token.
But here's the part most teams miss: once the skill exists, the daily work it governs no longer needs that model. For marketing teams, this is the difference between re-prompting Claude from scratch every time and building a skill that encodes your positioning, your voice, and your customers' actual language once. The Claude Sonnet vs Opus question stops being a per-task judgment call: Opus is the architect that builds the skill, Sonnet is the operator that runs it. We now use the thinking models for three things at skill-creation time: writing the skill itself, assessing which model the skill should run on day-to-day, and flagging where the output deserves more than text — a diagram, a visual, a structured artifact. The expensive model becomes the architect. It stops being the hourly contractor.
3. Structure context where the model can pull it
Skills handle judgment and process. But the deepest context a company has isn't process — it's knowledge. What your customers actually said. Why your best deals closed. The exact language a buyer used when they described the problem your product team has never named. No prompt template carries that, and pasting transcripts into a chat window doesn't scale.
This is the bet we've made with Proofmap. The proof of record is stakeholder intelligence captured on the record — intentional interviews, on video, dual-consent approved — and structured so AI tools can query it in real time over MCP. The result is that work as nuanced as product positioning — the work content marketers and product marketers are actually judged on — runs on teed-up context instead of demanding minutes of frontier-model inference that still comes back generic. The model isn't guessing how your stakeholders see the world. It's reading what they said. Fewer tokens in, and output you can trace back to a real person.
This is the same architectural distinction we've written about as the difference between generic AI and Proof-Native AI. It's also the real answer to a question every content marketer has typed into a search bar: why does AI content sound so generic? Not because the model is weak — because it was given nothing of yours to reason from. Generic AI is proof-optional: it runs on the internet's general knowledge plus whatever you remembered to paste in, and it produces output indistinguishable from every competitor using the same tools. Proof-Native AI is built on the proof of record. Generic AI accelerates noise. Proof-Native AI accelerates what's real.
Input-Side Leverage
Call the whole pattern input-side leverage. Every hour of structure you build upstream — a template, a skill, a structured proof of record — is reasoning you never pay for again. And unlike a model upgrade, it compounds: the context layer gets richer with every interview captured, every decision rule encoded, every workflow specified.
There's a strategic kicker, too. Model capabilities are converging — every company will have access to roughly the same intelligence on the output side. Your prompts can be copied. Your model choice can be matched. The input layer is the part competitors can't replicate, because it's made of your context: your stakeholders, your decisions, your proof. The moat was never the model.
The gap is rarely in the model's intelligence. It's in what you haven't written down — and that gap gets more expensive every single prompt.
So before you reach for the bigger model, audit your inputs. Where is your team renting reasoning to reconstruct context that already exists? What would it take to capture it once — structured, on the record, where every model you'll ever use can find it?
That's not a model-selection question anymore. That's an architecture decision. And it's the one that compounds.

