By Sleepy.md
In that telegram era where every word cost money, words were as valuable as gold. People were accustomed to condensing long messages into concise phrases, where a simple "safe return" could replace a long letter, and "safety first" was the most emphasized reminder.
Later on, with the introduction of the telephone into households, long-distance calls were billed by the second. Parents' long-distance calls were always brief and to the point. Once the main topic was covered, they would hurriedly hang up. If the conversation slightly prolonged, thoughts of the expensive call would promptly cut short any small talk.
Further down the line, broadband entered homes, charging by the hour for internet usage. People would stare at the timer on their screens, closing web pages as soon as they opened them, only daring to download videos since streaming was considered a luxury verb at that time. At the end of each download progress bar lay people's longing to "connect with the world" and their fear of "insufficient balance."
The unit of billing kept changing, but the instinct to save money remained timeless.
Today, Tokens have become the currency of the AI era. However, most people have yet to learn how to budget in this era because we have not yet grasped how to calculate gains and losses within invisible algorithms.
When ChatGPT emerged in 2022, hardly anyone cared about what Tokens were. It was the era of AI feasts, where you could chat as much as you wanted for $20 a month.
But since the recent rise of AI Agents, Token expenses have become something that everyone using an AI Agent must pay attention to.
Unlike simple Q&A conversations, behind a task flow are hundreds or thousands of API calls. The independent thinking of an Agent comes at a cost. Every self-correction, every tool invocation corresponds to fluctuations in the bill. Suddenly, you find that the money you deposited is no longer sufficient, and you have no idea what the Agent has been up to.
In real life, everyone knows how to save money. When buying groceries at the market, we know to clean off the mud and wilted leaves before weighing them. Taking a taxi to the airport, experienced drivers know to avoid the elevated roads during rush hour.
The logic of saving money in the digital world is similar, except the billing unit has changed from "kilograms" and "kilometers" to Tokens.

In the past, saving was due to scarcity; in the AI era, saving is for precision.
Through this article, we hope to help you outline a methodology for saving money in the AI era so that you can spend every penny wisely.
In the AI era, the value of information is no longer determined by its breadth but by its purity.
The billing logic of AI is based on the number of words it reads. Whether you feed it profound insights or meaningless jargon, as long as it reads it, you have to pay.
Therefore, the first mindset to save Tokens is to engrave "Signal-to-Noise Ratio" into your subconscious.
Every word, every image, every line of code you feed AI has a cost. So before handing anything over to AI, remember to ask yourself: how much of this is truly needed by AI? How much is muddy and rotten?
For example, verbose opening greetings like "Hello, please help me with..." background introductions that repeat, and code comments that were not properly deleted are all muddy and rotten leaves.
Furthermore, the most common waste is to directly feed AI a PDF or a webpage screenshot. While this may save you effort, in the AI era, "saving effort" often means "costing more."
A well-formatted PDF not only includes the main content but also header, footer, chart labels, hidden watermarks, and a large amount of formatting code for typesetting. These elements do not help AI understand your question, but you will be charged for all of them.
Next time, remember to convert the PDF to clean Markdown text before feeding it to AI. When you turn a 10MB PDF into a 10KB clean text, you not only save 99% of the cost but also significantly speed up AI's processing.
Images are another money-eating beast.
In the logic of visual models, AI doesn't care if your photo is beautiful; it only cares about how much pixel area you occupy.
Using Claude's official calculation logic: Image Token Consumption = Width Pixels × Height Pixels ÷ 750.
For a 1000×1000 pixel image, it consumes about 1334 Tokens, which, according to Claude Sonnet 4.6 pricing, is approximately $0.004 per image;
However, if the same image is compressed to 200×200 pixels, it only consumes 54 Tokens, reducing the cost to $0.00016, a difference of a full 25 times.
Many people directly feed AI high-resolution photos taken with their phones or 4K screenshots, unknowingly consuming Tokens that could be enough for AI to read more than half of a novella. If the task is only to recognize the text in the image or perform simple visual judgments, such as having AI recognize the amount on an invoice, read text in an instruction manual, or determine if there is a traffic light in the image, then 4K resolution is simply a waste. Compressing the image to the minimum usable resolution is sufficient.
However, the easiest way to waste Tokens at the input end is actually not the file format but the inefficient way of speaking.
Many people treat AI as a human neighbor, accustomed to communicating with a social, chatty manner, starting with a sentence like "help me write a webpage," waiting for AI to spit out a half-finished product, then adding details, and pulling back and forth repeatedly. This toothpaste-squeezing-style conversation will make AI repeatedly generate content, with each round of modification adding to Token consumption.
Engineers at Tencent Cloud have found in practice that for the same requirement, a toothpaste-squeezing multi-round conversation often consumes Tokens that are 3 to 5 times what could be explained in one go.
The real way to save money is to abandon this inefficient social probing, clearly state the requirements, boundary conditions, and reference examples in one go. Spend less effort explaining "what not to do" because negations often consume more understanding costs than affirmations; tell it directly "how to do it" and provide a clear, correct demonstration.
Also, if you know where the target is, tell AI directly, don't let AI play detective.
When you command AI to "find some user-related code," it must conduct large-scale scanning, analysis, and guesswork in the background; whereas when you directly tell it to "look at the src/services/user.ts file," the difference in Token consumption is like night and day. In the digital world, information symmetry is the greatest efficiency.
There's an unspoken rule in large model billing that many people aren't aware of: output Tokens are usually 3 to 5 times more expensive than input Tokens.
In other words, what AI says is much more costly than what you say to it. Taking Claude Sonnet 4.6's pricing as an example, inputting every million Tokens costs only $3, while outputting suddenly jumps to $15, a whopping 5-fold price difference.
All those "Alright, I fully understand your requirements and will now begin to answer them..." polite opening lines and those "Hope the above information is helpful to you" polite endings are social etiquette in human communication, but on an API bill, these formalities with zero informational value will also cost you money.
The most effective way to address output terminal waste is to set rules for AI. Use system commands to explicitly tell it: no small talk, no explanations, no restating of requests, just provide the answer.
These rules only need to be set once and will take effect in every conversation, truly embodying the principle of "one-time input, perpetual benefit" in finance. However, when establishing these rules, many people fall into another trap: issuing verbose natural language instructions.
Engineer-tested data shows that the efficacy of instructions lies not in word count, but in density. By compressing a 500-word system prompt to 180 words, removing meaningless pleasantries, consolidating repeated instructions, and restructuring paragraphs into a concise itemized list, the quality of AI output remains almost unchanged, yet token consumption per call can plummet by 64%.
Another, more proactive means of control is limiting output length. Many people never set an output cap, allowing AI free rein, which often leads to extreme cost escalation. You may only need a brief, straightforward sentence, but AI, in an effort to showcase a certain "intellectual sincerity," unreservedly generates an 800-word essay.
If you seek pure data, you should compel AI to return results in a structured format rather than lengthy natural language descriptions. Given an equivalent amount of information, JSON format incurs much lower token consumption compared to prose. This is because structured data eliminates all redundant conjunctions, particles, and explanatory modifiers, retaining only a high concentration of logical core. In the AI era, you should be acutely aware that what is worth paying for is the value of the outcome, not that meaningless self-explanation from AI.
Furthermore, AI's "overthinking" is also voraciously depleting your account balance.
Some advanced models have an "extended reasoning" mode that conducts massive internal reasoning before responding. This reasoning process also incurs charges based on the price of the output, which can be quite expensive.
This mode is essentially designed for "complex tasks requiring deep logical support." However, most people also choose this mode when asking simple questions. For tasks that do not require deep reasoning, explicitly instructing AI to "skip explanations and provide the answer directly" or manually turning off extended reasoning can save you a considerable amount of money.
Large models do not have true memory; they just endlessly dwell on old matters.
This is an underlying mechanism that many people are unaware of. Every time you send a new message in a conversation window, AI does not start understanding from that sentence; instead, it rereads all your past interactions, including every round of dialogue, every piece of code, and every referenced document, before responding to you.
In the billing of Tokens, this "learning from the past" is by no means free. As the rounds of conversation stack up, even if you're just asking about a simple word, the cost of AI rereading the entire old account grows exponentially. This mechanism determines that the heavier the conversation history, the more expensive each of your questions becomes.
Someone tracked 496 real dialogues containing over 20 messages each and found that the average reading of the 1st message was 14,000 Tokens, costing about 3.6 cents per message; by the 50th message, the average reading was 79,000 Tokens, costing about 4.5 cents per message, a whopping 80% more expensive. Furthermore, as the context grows longer, by the 50th message, the context that the AI has to reprocess is already 5.6 times the context of the 1st message.
To address this issue, the simplest habit is: one task, one dialogue box.
When a topic is discussed, promptly start a new dialogue; do not treat AI as an always-on chat window. This habit sounds simple, but many people just can't do it, always thinking, "What if I need to refer back to the previous content?" In reality, most of the time, those "what-ifs" you worry about never occur, and for that "what-if," you end up paying multiple times more for every new message.
When a conversation does need to continue but the context has become lengthy, we can use some tools' compression functions. Claude Code has a /compact command that can condense the long dialogue history into a short summary, helping you practice cyber decluttering.
There's also a money-saving logic called Prompt Caching. If you repeatedly use the same system prompt or need to reference the same document in every conversation, the AI will cache this content. The next time it is called upon, it will only charge a minimal cache read fee, rather than a full price charge each time.
Anthropic's official pricing displays that the Token price for cached hits is 1/10 of the regular price. OpenAI's Prompt Caching similarly reduces input costs by approximately 50%. A paper published in January 2026 on arXiv examined long tasks across multiple AI platforms and found that prompt caching could reduce API costs by 45% to 80%.
In other words, for the same content, the first time you feed it to AI, you pay the full price, but on subsequent calls, you only pay 1/10. For users who need to repeatedly use the same set of specification documents or system prompts every day, this feature can save a significant amount of Tokens.
However, Prompt Caching has one prerequisite: your system prompt wording and reference document content and order must remain consistent and at the beginning of the conversation. Once the content is altered in any way, the cache becomes invalid, and full-price billing applies again. So, if you have a set of fixed work norms, hardcode them and avoid arbitrary modifications.
The final context management technique is on-demand loading. Many people like to cram all specifications, documents, and notes into the system prompts, just in case.
However, the cost of doing this is that when you are simply performing a straightforward task, you are forced to load thousands of words of rules, wasting a bunch of tokens for no reason. Claude Code's official documentation suggests keeping CLAUDE.md to under 200 lines, breaking down specialized rules for different scenarios into separate skill files, and loading the rules only for the scenario in use. Maintaining absolute purity of context is the highest form of respect for computational power.
Various AI models have a significant price difference.
Claude Opus 4.6 costs $5 for every million tokens input and $25 for output, while Claude Haiku 3.5 only requires $0.8 for input and $4 for output, nearly a six-fold difference. Having the top-tier model do the grunt work of gathering information and formatting is not only slow but also very expensive.

The smart approach is to apply the common human societal concept of "division of labor" to the AI community, assigning tasks of different difficulties to models at different price points.
Just as in the real world when you hire someone for a job, you wouldn't specifically hire a bricklaying expert with a million-dollar salary to do manual labor on a construction site. AI works the same way. Claude Code's official documentation also explicitly recommends: use Sonnet for most programming tasks, reserve Opus for complex architectural decisions and multi-step reasoning, and designate Haiku for simple subtasks.
A more specific practical solution is to build a "two-stage workflow." In the first stage, use free or inexpensive basic models to do preliminary dirty work, such as data collection, format cleaning, initial draft generation, simple classification, and summarization. Then, in the second stage, feed the refined essence to top-tier models for core decision-making and deep refinement.
For example, if you need to analyze a 100-page industry report, you can first use Gemini Flash to extract key data and conclusions from the report, condense it into a 10-page summary, and then pass this summary to Claude Opus for in-depth analysis and judgment. This two-stage workflow can significantly reduce costs while ensuring quality.
Going beyond simple paragraphing, a more advanced approach is task-based deep work division. A complex engineering task can be broken down into several independent sub-tasks, each matched with the most suitable model.
For example, for a coding task, a cost-effective model can first write the framework and boilerplate code, and then only assign the implementation of the core logic to a more expensive model. Each sub-task has a clean, focused context, resulting in more accurate outcomes and lower costs.
All the previous discussions fundamentally address tactical issues of "how to save money," but many people have overlooked a more foundational logical proposition: Does this action really require spending tokens?
The most extreme form of saving is not algorithm optimization but rather the act of decluttering decision-making. We have grown accustomed to seeking universal answers from AI, forgetting that in many scenarios, invoking an expensive large model is akin to using a cannon to kill a mosquito.
For instance, letting AI automatically handle emails leads to each email being interpreted, categorized, and replied to as an independent task, resulting in significant token consumption. However, if you first spend 30 seconds scanning your inbox, manually filtering out emails that clearly do not need AI processing, and then hand over the rest to AI, the cost immediately reduces to a fraction of the original. Human judgment here is not a hindrance but the best filtering tool.
People from the telegram era knew how much extra it would cost to send an additional word, so they would consider it, displaying an intuitive sense of resource usage. The AI era is no different. When you truly understand how much it costs for AI to say one more sentence, you naturally weigh whether it's worth having AI do it, whether the task requires a top-tier model or a cost-effective one, and if the context is still relevant.
This kind of consideration is the most cost-effective ability. In an era where computational power is becoming more expensive, the smartest usage is not to let AI replace humans but to let AI and humans each do what they excel at. When this sensitivity to tokens becomes a reflexive action, you truly transition from being a subordinate to computation to being its master.
Welcome to join the official BlockBeats community:
Telegram Subscription Group: https://t.me/theblockbeats
Telegram Discussion Group: https://t.me/BlockBeats_App
Official Twitter Account: https://twitter.com/BlockBeatsAsia