Your Claude Limit Burns In 90 Minutes Because Of One ChatGPT Habit.
The next generation of frontier AI models — Claude Mythos, whatever ChatGPT drops next, the next Gemini — will be significantly more expensive to run because they’re trained on far more expensive hardware like Nvidia’s GB300 series chips. Meanwhile, ambient compute from cheaper models will become essentially free. But if you want cutting-edge intelligence, you need to stop burning tokens and blaming the model. Token efficiency is rapidly becoming one of the most valuable skills for anyone working with AI, because the models are not expensive — your habits are what cost a lot. Jensen Huang gave a real number in a real interview: $250,000 per year is what he expects an individual engineer to spend on tokens. You don’t want to be spending that kind of money on tokens you didn’t need to spend.
Nate illustrates this with a real-world example: a production AI pipeline he personally reviewed that ingests multiple long-form conversations per user, runs analysis across dozens of dimensions, and generates a fully personalized output — all on the most expensive models available. The cost per user? Less than 25 cents. Frontier AI can be absurdly cheap when you know what you’re doing. This video lays out the specific habits that waste tokens across every skill level, the financial math behind sloppy versus clean prompting, a six-question diagnostic framework, and five commandments for agent builders.
The Rookie Mistake: Document Ingestion
The single most common token-wasting habit among new users is dragging raw PDFs into a conversation. A new Claude Desktop user might drag in three PDFs that contain 1,500 words each — just 4,500 words of actual text — and say “Summarize these.” Claude then processes the raw PDFs with all the formatting overhead: headers, footers, embedded fonts, layout metadata, and the entire binary structure gets encoded as tokens. Those 4,500 words of content can balloon to over 100,000 tokens.
The fix is dead simple: convert to Markdown first. Ask Claude to do it, or use any of a number of free web services. It takes 10 seconds. The result is a clean set of content between 4,000 and 6,000 tokens — a 20x savings. And this waste compounds, because once those 100,000 tokens are in your conversation history, they bounce back and forth with every turn.
So many file formats are designed to be human-readable, not AI-readable. If you want to reason about the visual style of a PDF, fine, keep it. But 99% of the time, all you care about is the text. Think about the token efficiency of every file format you feed in. Screenshots are another offender — terribly inefficient compared to just copying and pasting the text.
Conversation Sprawl
If you’re doing 20, 30, 40 turns on a single conversation, no AI was reinforcement-learned, trained, or designed to handle that kind of sprawl. All you’re doing is compressing the ratio of the conversation where the original instructions happened. Yes, the models are getting better at anchoring on those original instructions even through compression, but why make them suffer? Why fill up the context window with cruft? Why waste tokens?
The solution is to separate your work into two distinct modes:
-
Gathering mode — Think with AI, research, explore. This might span multiple models and multiple conversations. Nate describes his own workflow: he’ll go to Grok for X/social sentiment, pipe earnings reports through ChatGPT’s thinking mode, run deep searches through Perplexity, and do targeted web search with Claude Opus 4.6. None of these are intended to produce a single answer. They’re all evolving research conversations.
-
Execution mode — Once you have all the context you need from gathering, pull it together into a fresh, highly structured prompt. Your objective should be so clear that the AI just goes and gets the work done. If a conversation must evolve, conclude it by asking for a summary, then start a brand-new chat with that summary.
Do not mix these two modes. That is how you burn tokens and confuse the AI.
The Plugin Tax
Intermediate users who’ve added lots of plugins and connectors to ChatGPT or Claude are paying a silent tax on every single conversation. Those tools load into the context window before you type your first word. Nate knows someone who is over 50,000 tokens into a context window before they’ve typed anything because of how many plugins and connectors they’ve loaded.
It’s like walking into a fully stocked workshop and the first thing you do is pull every single tool off the wall and lay them out on the workbench before deciding what to build. You probably need five tools, not 200. So many people hear about a new plugin, add it because it seemed cool on launch day, and forget it’s there. It’s like a barnacle on a ship — it slows you down, burns tokens, and confuses the model about which tools to use. Audit your plugins. It matters.
Stale Context: The Advanced User’s Expensive Mistake
Advanced users — the people comfortable with GitHub repos, local installs, API gateways — have the most leverage but make the most expensive mistakes. When they screw up, it’s at the level of hundreds of thousands or millions of tokens.
If you’re the person responsible for the system prompt on an agent and you haven’t pruned it in the last couple of weeks, you’re being irresponsible. If a hundred lines in your prompt have been there since GPT-3.5 and you’ve never questioned whether they’re still needed, that’s dead weight. If you’re loading an entire repo into the context window because “it seemed to work two generations ago” and you never re-tested, that’s architectural laziness.
The larger trend in AI is clear: in 2025, we needed to frontload a lot of context because the models were dumber. Now in 2026, as models get more intelligent, you can lean out the context window and trust the model to retrieve better. Allow the gains in model intelligence to lean out your context. This is practical preparation for Claude Mythos. And for technical users, these are million-token decisions, especially when an agent runs repeatedly.
The Financial Math: Sloppy vs. Clean
A concrete comparison for a 5-hour work session:
The Sloppy Workflow
- Feed raw PDFs into context (100,000 tokens vs. 5,000)
- 30-turn conversation sprawl
- Use Opus 4.6 for everything including formatting and proofreading
Result:
- Input tokens: ~800,000 to 1,000,000
- Output tokens: ~150,000 to 200,000 (including thinking)
- Cost: 10 (at 25 out per million)
The Clean Workflow
- Convert documents to Markdown first
- Start fresh conversations every 10-15 turns
- Use Opus for reasoning, Sonnet for execution, Haiku for polish
- Scope context to what’s actually needed
Result:
- Input tokens: ~100,000 to 150,000
- Output tokens: ~50,000 to 80,000
- Cost: ~$1
That’s an 8x to 10x reduction for the same output. Scaled out:
- Sloppy user: $40-50/week in compute
- Clean user: $5-7/week
- Across a 10-person team on API: 250/month for the exact same result
- For subscription users: the difference between hitting your limit daily and forgetting limits exist
What Happens When Mythos Ships
Mythos is rumored to be by far Anthropic’s most expensive model. Nate expects a new pricing class well above the current 25 range — possibly as high as 250 out per million tokens (he frames this as a thought exercise, not a confirmed price). Even at a more conservative 50 range, the same point holds: your inefficiency mistakes scale linearly with the price of intelligence. The habits that were tolerable at today’s pricing become genuinely expensive at Mythos pricing.
The “Stupid Button”: 6 Diagnostic Questions
Nate built a diagnostic tool — he calls it the “stupid button” — to help users audit their token usage. It’s built around six questions:
-
Are you feeding raw PDFs, images, or screenshots when all you need is text? Convert to Markdown. Always. Claude can do it in seconds.
-
When was the last time you started a fresh conversation? Every turn in a conversation sends the entire history back to the model. This applies to Claude, ChatGPT, Gemini, Llama, Qwen — it’s how all LLMs work. Long-running conversations also correlate with “LLM psychosis” where models drift from their instructions.
-
Are you using the most expensive model for everything? Simple formatting tasks don’t need Opus or GPT-5.4 Pro mode. Don’t bring a Ferrari to the grocery store.
-
Do you know what’s loading in context before you type? In Claude Code, you can run
/contextto check. Look at how many connectors and plugins are active. If you enabled Google Drive months ago and never use it, drop it. -
Are you caching stable context? (API builders) Prompt caching gives a 90% discount on repeated content. Cache hits on Opus cost 5 per million standard. If your system prompt, tool definitions, and reference documents aren’t cached, you’re pouring money down the drain.
-
How are you handling web search? Native web search in heavy LLMs tends to be token-inefficient. Perplexity via MCP burns roughly 10,000 to 50,000 fewer tokens per search, runs about 5x faster, and returns structured citations. The broader point: use dedicated, token-efficient services for search rather than letting your most expensive model handle it natively.
What the Stupid Button Actually Contains
The tool ships as three components:
- A prompt — Run it against your recent conversations. It identifies which documents you’re feeding raw, flags conversation sprawl, spots model misuse, and checks for redundant context loading. Works on any plan, no setup required.
- A skill — An invocable audit that measures per-session token overhead, flags system prompt load, checks plugin and skill loading, and gives you a before-and-after comparison. Like a gas gauge for your tokens.
- Guardrails — Infrastructure that sits on your knowledge store (Open Brain). Automatic Markdown conversion for documents hitting the store, index-first retrieval instead of dump-and-search, and context scoping that enforces minimum viable context per query. This is where token management stops being personal discipline and becomes self-maintaining infrastructure.
The 5 Commandments for AI Agents
Agents can burn hundreds of millions of tokens. Context management for agentic systems requires its own discipline:
1. Index Your References
If an agent is getting raw documents instead of relevant chunks, you’ve already failed. The entire point of retrieval is to scope what the model sees to what it needs. Dumping a full document set into the window on every agent call is wildly irresponsible. Don’t make the agent do work it doesn’t need to do.
2. Prepare Your Context for Consumption
Pre-process, pre-summarize, pre-chunk. A reference document should arrive in an agent’s context ready to be used — not ready to be read or processed. If the model’s first several thousand tokens of reasoning are spent dealing with crappy pre-processing, you’re not being a responsible agent builder.
3. Cache Your Stable Context
System prompts, tool definitions, persona instructions, reference material — anything stable should be cached at a 90% discount on cache hits. This is the lowest-effort, highest-impact optimization available. If you’re making thousands of agent calls a day without caching, you’re just pouring money down the drain.
4. Scope Context to the Minimum Viable Need
A planning agent does not need your full codebase. An editing agent doesn’t need your project roadmap. Passing everything to every agent is architectural laziness with real costs — both in tokens burned and in degraded agent performance. Models perform worse when drowning in irrelevant context.
And if you’re thinking “won’t smarter agents just find what they need?” — yes, but only efficiently if you give them a searchable repo that is pre-processed so they can retrieve the relevant slice. Take the time to do it right.
5. Measure What You Burn
If you don’t know your per-call token cost, you’re optimizing blind. Instrument your agent calls. Track:
- Input tokens per call
- Output tokens per call
- Overall model mix
- Cost ratios
You cannot improve what you do not measure. Most teams building agentic systems think a lot about semantic correctness — whether the output reads well — but not about functional correctness or model cost. Today, the $12-per-run cost might not make or break the project. But plan for a world where models are more expensive and you have to scale.
At some point in the last few months, burning tokens became a badge of honor. And there is a degree to which you need to be burning tokens to do meaningful work in the age of AI. None of this is an ask to stop using tokens — it’s an ask to use them efficiently. When Jensen says $250,000 in token costs per developer, the question isn’t the dollar amount. It’s whether those were smart tokens. So yes, max out your Claude. Be bold and audacious about what you aim these models at. But know what you’re spending on, don’t spend it on silly stuff like unconverted PDFs, and actually direct those tokens toward meaningful work. If we can be more efficient, we can do a whole lot more cool and creative stuff with the tokens we have.
Meta
Added: 2026-04-02