OpenClaw Burns 21.5 Million Tokens in a Day? Three Optimization Strategies Drastically Reduce Costs
Original Author: MOSHIII
Original Compilation: Peggy, BlockBeats
Editor’s Note: With the rapid adoption of Agent applications, many teams have encountered a seemingly paradoxical phenomenon: the system runs smoothly, but token costs continue to rise unnoticed. This article, through a dissection of a real OpenClaw workload, finds that the cause of cost explosions often does not stem from user input or model output, but rather from the overlooked cached prefix replay. The model repeatedly reads massive historical context in each round of calls, resulting in enormous token consumption.
The article, using specific session data, demonstrates how large intermediate outputs such as tool outputs, browser snapshots, and JSON logs are continuously written into the historical context and repeatedly read during agent loops.
Through this case study, the author proposes a clear optimization framework: from context structure design and tool output management to compaction mechanism configuration. For developers building Agent systems, this is not only a technical troubleshooting record but also a practical 가이드 to saving real money.
The following is the original text:
I analyzed a real OpenClaw workload and identified a pattern I believe many Agent users will recognize:
토큰 usage looks “active”
Replies also seem normal
But token consumption suddenly explodes
Below is the structural breakdown, root cause, and practical fix path from this analysis.
요약
The biggest cost driver is not overly long user messages. It’s the massive cached prefix being repeatedly replayed.
From the session data:
Total tokens: 21,543,714
cacheRead: 17,105,970 (79.40%)
input: 4,345,264 (20.17%)
output: 92,480 (0.43%)
In other words: The cost of most calls is not actually processing new user intent, but repeatedly reading a huge historical context.
The “Wait, How Did This Happen?” Moment
I initially thought high token usage came from: very long user prompts, massive output generation, or expensive tool calls.
But the dominant pattern was actually:
input: a few hundred to a few thousand tokens
cacheRead: 170k to 180k tokens per call
Meaning, the model was repeatedly reading the same massive, stable prefix every single round.
Data Scope
I analyzed data at two levels:
1. Runtime logs
2. Session transcripts
It should be noted:
Runtime logs are mainly for observing behavioral signals (e.g., restarts, errors, configuration issues)
Precise token statistics come from the usage field in session JSONL files
Scripts used:
scripts/session_token_breakdown.py
scripts/session_duplicate_waste_analysis.py
Generated analysis files:
tmp/session_token_stats_v2.txt
tmp/session_token_stats_v2.json
tmp/session_duplicate_waste.txt
tmp/session_duplicate_waste.json
tmp/session_duplicate_waste.png
Where Are Tokens Actually Being Consumed?
1) Session Concentration
One session’s consumption was far higher than others:
570587c3-dc42-47e4-9dd4-985c2a50af86: 19,204,645 tokens
Then a clear cliff-like drop:
ef42abbb-d8a1-48d8-9924-2f869dea6d4a: 1,505,038
ea880b13-f97f-4d45-ba8c-a236cf6f2bb5: 649,584
2) Behavior Concentration
Tokens mainly came from:
toolUse: 16,372,294
stop: 5,171,420
This indicates the problem lies primarily in tool call chain loops, not regular chat.
3) Time Concentration
Token peaks weren’t random; they were concentrated in a few hourly blocks:
2026-03-08 16:00: 4,105,105
2026-03-08 09:00: 4,036,070
2026-03-08 07:00: 2,793,648
What’s Actually in That Huge Cached Prefix?
Not conversation content, but mainly large intermediate artifacts:
Massive toolResult data blocks
Long reasoning / thinking traces
Large JSON snapshots
File lists
Browser scraped data
Sub-Agent conversation transcripts
In the largest session, the character count was roughly:
toolResult:text: 366,469 characters
assistant:thinking: 331,494 characters
assistant:toolCall: 53,039 characters
Once this content is retained in the historical context, every subsequent call potentially re-reads it via the cache prefix.
Specific Examples (from session files)
Massive context blocks repeatedly appeared at these locations:
sessions/570587c3-dc42-47e4-9dd4-985c2a50af86.jsonl:70
Large gateway JSON log (~37k characters)
sessions/570587c3-dc42-47e4-9dd4-985c2a50af86.jsonl:134
Browser snapshot + security wrapper (~29k characters)
sessions/570587c3-dc42-47e4-9dd4-985c2a50af86.jsonl:219
Huge file list output (~41k characters)
sessions/570587c3-dc42-47e4-9dd4-985c2a50af86.jsonl:311
session/status status snapshot + large prompt structure (~30k characters)
“Duplicate Content Waste” vs. “Cache Replay Burden”
I also measured the proportion of duplicate content within a single call:
Duplicate ratio: ~1.72%
It exists, but it’s not the main issue.
The real problem is: the absolute volume of the cached prefix is too large.
The structure is: huge historical context, re-read every call, with only a small amount of new input layered on top.
Therefore, the optimization focus is not deduplication, but context structure design.
Why Are Agent Loops Particularly Prone to This?
Three mechanisms compound each other:
1. Large amounts of tool output are written into historical context.
2. 도구 loops generate many short-interval calls.
3. The prefix changes very little → the cache re-reads it every time.
If context compaction doesn’t trigger reliably, the problem amplifies quickly.
Most Important Fix Strategies (Sorted by Impact)
P0—Don’t Stuff Huge Tool Outputs into Long-Term Context
For oversized tool outputs:
- Keep a summary + reference path / ID
- Write the original payload to a file artifact
- Do not keep the full original text in chat history
Prioritize limiting these categories:
- Large JSON
- Long directory listings
- Full browser snapshots
- Complete sub-Agent transcripts
P1—Ensure the Compaction Mechanism Actually Works
In this data, configuration compatibility issues appeared multiple times: compaction key invalid.
This silently disables the optimization mechanism.
The correct approach: Only use version-compatible configurations.
Then verify:
openclaw doctor –fix
And check startup logs to confirm compaction is accepted.
P1—Reduce Reasoning Text Persistence
Avoid long reasoning text being repeatedly replayed.
In production: Save a brief summary, not the full reasoning.
P3—Improve Prompt Caching Design
The goal is not to maximize cacheRead. The goal is to use cache on a compact, stable, high-value prefix.
Suggestions:
- Put stable rules in the system prompt.
- Don’t put unstable data into the stable prefix.
- Avoid injecting large amounts of debug data every round.
Practical Stop-Loss Plan (If I Had to Handle This Tomorrow)
1. Identify the session with the highest cacheRead percentage.
2. Execute /compact on runaway sessions.
3. Add truncation + artifactization to tool outputs.
4. Re-run token statistics after each change.
Focus on tracking four KPIs:
cacheRead / totalTokens
toolUse avgTotal/call
Number of calls with >=100k tokens
Largest session percentage
Signals of Success
If optimization works, you should see:
Significant reduction in 100k+ token calls
Decrease in cacheRead percentage
Decrease in toolUse call weight
Reduced dominance of a single session
If these metrics don’t change, your context strategy is still too permissive.
Reproduction Experiment Commands
python3 scripts/session_token_breakdown.py ‘sessions’
–include-deleted
–top 20
–outlier-threshold 120000
–json-out tmp/session_token_stats_v2.json
> tmp/session_token_stats_v2.txt
python3 scripts/session_duplicate_waste_analysis.py ‘sessions’
–include-deleted
–top 20
–png-out tmp/session_duplicate_waste.png
–json-out tmp/session_duplicate_waste.json
> tmp/session_duplicate_waste.txt
결론
If your Agent system seems to be running fine, but costs keep rising, first check one thing: Are you paying for new reasoning, or for massively replaying old context?
In my case, the vast majority of the cost came from context replay.
Once you realize this, the solution is clear: Strictly control what data enters the long-term context.
이 글은 인터넷에서 퍼왔습니다: OpenClaw Burns 21.5 Million Tokens in a Day? Three Optimization Strategies Drastically Reduce Costs
Related: The Most Insane Ethereum L2: An L2 Spontaneously Organized and Built by AI Agents
Cypresses (Van Gogh) Yesterday we discussed the most strategically valuable Ethereum L2s. Today, let’s talk about the coolest Ethereum L2s. This idea seems crazy, but it’s not impossible. In simple terms, when an AI agent operates on Ethereum L1 and encounters performance bottlenecks (such as high gas fees, latency, computational limits), it could theoretically “spontaneously” initiate a migration or expansion to an L2. However, truly “inheriting and spontaneously forming an L2 chain”—meaning the agent autonomously deploys, configures, and runs a new L2—is not yet fully feasible with the 2026 technology stack. Nonetheless, with the maturation of standards like ERC-8004, such autonomous behaviors may gradually approach reality. Let’s break it down: Early Stages Resemble “Migration” More Than “Spontaneous Formation” The “Intelligence” Boundary of AI Agents Current AI agents (based on ERC-8004)…







