Back to InsightsMarketing

Where the Reasoning Lives: The Model-Selection Question Everyone Is Asking Wrong

Teams pick AI models based on how hard the task feels. That's the wrong axis — and it's costing them on every single prompt.

Updated June 14, 20267 min readBy Andy Stauffer, Founder & CEO, Proofmap
AIContext EngineeringProof-Native AI

Every team running AI in production eventually hits the same question: which model do we use for this? The fast, efficient one — or the heavyweight that thinks for thirty seconds before answering?

The standard answer is to route by task complexity. Simple task, fast model. Hard task, thinking model. It sounds reasonable, and it's how most teams operate today. It's also wrong — or at least, it's only half the picture.

I've spent the past two years building an AI-native company and embedding AI into every marketing, sales, and product workflow we run — ours and our clients'. The pattern that emerged surprised me: the tasks that needed the expensive thinking models weren't reliably the hard ones. They were the underspecified ones.

You're not paying for task difficulty. You're paying for unwritten context.

The Two Axes

Here's the rule of thumb I've landed on. The more detailed and direct your prompt, the more you can use a fast model — it will just go and execute. The broader and more nuanced your prompt, even when it's intentionally open-ended, the more a high-thinking model earns its cost.

What's really happening is a question of where the reasoning lives. A detailed prompt means you've already done the decomposition: you've decided the approach, the constraints, what "good" looks like. All the model has to do is execute — and execution is what fast models are built for. A broad prompt means the model has to do that decomposition itself: figure out what you meant, which tradeoffs matter, what the success criteria are. That interpretive work is exactly what extended thinking exists to do.

So there are two axes, not one. The first is the specification gap — how much interpretation you've left to the model. The second is intrinsic difficulty — how hard the task is even when fully specified. A precisely specified, genuinely gnarly problem still deserves a thinking model. But most teams dramatically underrate the first axis. They pick models based on how hard the task feels, rather than on how much of the thinking they've already done themselves.

SPECIFICATION GAP → (how much is left to interpret) INTRINSIC DIFFICULTY → Fast model. Specified and routine — just execute. Thinking model — it has to figure out what you meant before it starts. Thinking model — difficulty earns it, even with a perfect spec. Frontier thinking. Hard problem, open brief — pay for every token of it. Structure moves work left
Figure 1 — Two axes, not one. Most teams route on the vertical axis. The horizontal one is the lever you actually control.

The Unwritten-Context Tax

Once you see model selection this way, the economics flip. Every time you hand a vague brief to a frontier model, you're renting expensive reasoning to reconstruct context that already exists somewhere in your company — in someone's head, in a doc nobody linked, in the way your best rep explains the product. The model burns tokens inferring your strategy, your voice, your customer, your constraints. Then the conversation ends, and tomorrow you pay the same tax again.

The expensive model is, in a sense, a tax on unwritten context. And unlike most taxes, this one is optional.

The teams getting the most out of AI right now aren't the ones with the biggest model budgets. They're the ones doing what Anthropic's engineers call context engineering — treating the model's input as the primary lever, and curating the smallest set of high-signal context that lets the model do the job. Most teams optimize the output side: better prompts on the day, bigger models, more retries. The compounding returns are on the input side.

Three Moves That Shift the Lever

1. Write down what you already know

The cheapest upgrade available to any team is specification. We run an agent that enriches our development backlog — reads raw tickets, explores the codebase, writes structured requirements. We launched it entirely on a fast model. Not because the work is simple, but because the ticket template carries the judgment: what a good ticket contains, how to classify it, what to check. The only piece we earmarked for a heavier model was the genuinely ambiguous part — deciding which tickets belong together. Difficulty earned the upgrade there. Everywhere else, the spec had already done the thinking.

2. Build Claude skills once — then run the daily work on a fast model

This is where Claude skills change the math. A skill is a structured document that encodes voice, strategy, decision rules, and process — context the model loads when it's relevant. Building a good one is genuinely hard work: it requires synthesizing how you think, what you'd never do, what good looks like. That's a job for a high-thinking model, and it's worth every token.

But here's the part most teams miss: once the skill exists, the daily work it governs no longer needs that model. For marketing teams, this is the difference between re-prompting Claude from scratch every time and building a skill that encodes your positioning, your voice, and your customers' actual language once. The Claude Sonnet vs Opus question stops being a per-task judgment call: Opus is the architect that builds the skill, Sonnet is the operator that runs it. We now use the thinking models for three things at skill-creation time: writing the skill itself, assessing which model the skill should run on day-to-day, and flagging where the output deserves more than text — a diagram, a visual, a structured artifact. The expensive model becomes the architect. It stops being the hourly contractor.

Renting reasoning (every prompt, forever) Vague brief "write our positioning" Thinking model re-infers your strategy, voice, and customer — from scratch Output (plausible, generic) Token meter: high — and the reasoning evaporates after each run Owning the input layer (structure once, reuse on every prompt) Structured context layer skills · templates · the proof of record, queryable over MCP Fast model executes against teed-up context Output (grounded in what's real) Token meter: low — the thinking was paid for once and compounds
Figure 2 — Rent the reasoning on every prompt, or build the input layer once and let it compound.

3. Structure context where the model can pull it

Skills handle judgment and process. But the deepest context a company has isn't process — it's knowledge. What your customers actually said. Why your best deals closed. The exact language a buyer used when they described the problem your product team has never named. No prompt template carries that, and pasting transcripts into a chat window doesn't scale.

This is the bet we've made with Proofmap. The proof of record is stakeholder intelligence captured on the record — intentional interviews, on video, dual-consent approved — and structured so AI tools can query it in real time over MCP. The result is that work as nuanced as product positioning — the work content marketers and product marketers are actually judged on — runs on teed-up context instead of demanding minutes of frontier-model inference that still comes back generic. The model isn't guessing how your stakeholders see the world. It's reading what they said. Fewer tokens in, and output you can trace back to a real person.

This is the same architectural distinction we've written about as the difference between generic AI and Proof-Native AI. It's also the real answer to a question every content marketer has typed into a search bar: why does AI content sound so generic? Not because the model is weak — because it was given nothing of yours to reason from. Generic AI is proof-optional: it runs on the internet's general knowledge plus whatever you remembered to paste in, and it produces output indistinguishable from every competitor using the same tools. Proof-Native AI is built on the proof of record. Generic AI accelerates noise. Proof-Native AI accelerates what's real.

Input-Side Leverage

Call the whole pattern input-side leverage. Every hour of structure you build upstream — a template, a skill, a structured proof of record — is reasoning you never pay for again. And unlike a model upgrade, it compounds: the context layer gets richer with every interview captured, every decision rule encoded, every workflow specified.

There's a strategic kicker, too. Model capabilities are converging — every company will have access to roughly the same intelligence on the output side. Your prompts can be copied. Your model choice can be matched. The input layer is the part competitors can't replicate, because it's made of your context: your stakeholders, your decisions, your proof. The moat was never the model.

The gap is rarely in the model's intelligence. It's in what you haven't written down — and that gap gets more expensive every single prompt.

So before you reach for the bigger model, audit your inputs. Where is your team renting reasoning to reconstruct context that already exists? What would it take to capture it once — structured, on the record, where every model you'll ever use can find it?

That's not a model-selection question anymore. That's an architecture decision. And it's the one that compounds.

Share this article:

Quick Answers

Should I use Claude Sonnet or Opus for marketing content?
Route by how specified the task is, not how hard it feels. When your brief already carries the judgment — voice, constraints, and what good looks like — a fast model like Sonnet can simply execute it. Reserve a thinking model like Opus for genuinely open-ended or ambiguous work, such as building the skill that the fast model will then run every day.
Why does AI-generated marketing content sound so generic?
Not because the model is weak, but because it was given nothing of yours to reason from. Generic AI runs on the internet's general knowledge plus whatever you remembered to paste in, so it produces output indistinguishable from competitors using the same tools. Grounding the model in your own customer language and proof is what makes the output specific.
What is a Claude skill, and how does it lower AI costs?
A Claude skill is a structured document that encodes your voice, strategy, decision rules, and process, which the model loads when it is relevant. Building a good one is hard work worth a thinking model, but once it exists the daily work it governs can run on a cheaper, faster model — so you pay for the reasoning once instead of on every prompt.
What is the proof of record?
The proof of record is stakeholder intelligence captured intentionally and on the record — interviews, on video, dual-consent approved — and structured so AI tools can query it in real time over MCP. It lets a model read what your stakeholders actually said instead of guessing. See how it works.
What is context engineering?
Context engineering is the practice of treating a model's input as the primary lever — curating the smallest set of high-signal context that lets the model do the job well. Most teams optimize the output side with better prompts and bigger models; the compounding returns are on the input side.

Drive Your GTM with Customer Proof

See how Proofmap turns customer interviews into on-record proof — ready for sales, marketing, and beyond.

Resource Hub

Explore the Insights Library

Guides, reports, and tools to help you build trust and grow with customer evidence.

Free Tools

AI-Powered Toolkits

Free tools and generators for building, capturing, and deploying customer proof at scale.

FEATURED

AI Case Study Generator

Turn customer interview notes into a polished case study draft in minutes.

View All Tools

Reports & Studies

Top B2B Social Proof Strategies Research Measurable ROI of Customer Advocacy Research
See All Reports

Latest Insights

B2B Case Studies: 58 SaaS Companies Benchmarked Research How to Write a SaaS Case Study That Converts Guide
See All Insights
See All Insights