How to reduce your token footprint
Newer, more capable models use more tokens. And users put whole books, code bases, videos into chats. We all knew where this was going to go: Monthly AI subscriptions are subsidized and don’t cover the cost of compute.
A token is a unit of meaning, like a word or part of a word. AI companies bill you for input tokens and for the tokens being generated, the output. For Opus-4.7, a million tokens in is $5, a million tokens out is $25.
Customers using OpenAI’s or Anthropic’s API are already familiar with pay-as-you-go billing. While some startups and tech companies brag about tokenmaxxing, the rest of us are figuring out which AI tasks can be delegated to smaller, token-efficient models: reasonable model, not reasoning model.
If you’re not coding, AI companies usually hide the token counts. A million tokens is like ten novels of text. Which sounds like a lot, but really isn’t, especially not if you’re pushing code or working with media. So what do you do if you have to watch your token use?
- Simple tasks, spell-checking, translation, web search: a smaller model will often get you there just fine. ChatGPT comes with mini and nano versions. Claude Opus has smaller siblings called Haiku and Sonnet. Working with German texts, Gemini-3-Flash ranks higher than GPT-5.4 at a fraction of the cost, and in creative writing it beats Sonnet-4.5. (LLM Leaderboard)
- Start small, go bigger: If the small models don’t deliver, switch to the mid-tier. GPT-5.3, Sonnet-4.6, Gemini-3.1. Try setting the reasoning effort. Only then bring out the premium models.
- Fresh start. Long conversations accumulate context, and the model processes the entire conversation with every new message. New topic, new chat.
- Watch big documents. Pasting in large texts, uploading big files, or referencing them significantly increases the tokens processed per message. Where possible, pull in specific sections rather than the whole thing.
- Tell AI to shut up. Sometimes it’s nice to have a freewheeling conversation, but if you know what you want, tell the machine to cut it: “Be concise. Give only the answer. No intro, no outro, no summary. Use bullets only if necessary.” While it might not reduce hidden reasoning effort, it helps keep the precious output tokens in check.
- Brace for impact: If you’re building a whole new web app for what could have been a short Python script or generating a thirteen-page research based on 65 web sources to tell you where to go for dinner, that’s gonna cost you. Plan accordingly.