OpenClaw Burns 21.5 Million Tokens in a Day? Three Optimization Strategies Drastically Reduce Costs

207 0 9

Original Author: MOSHIII

Original Compilation: Peggy, BlockBeats

Editor’s Note: With the rapid adoption of Agent applications, many teams have encountered a seemingly paradoxical phenomenon: the system runs smoothly, but token costs continue to rise unnoticed. This article, through a dissection of a real OpenClaw workload, finds that the cause of cost explosions often does not stem from user input or model output, but rather from the overlooked cached prefix replay. The model repeatedly reads massive historical context in each round of calls, resulting in enormous token consumption.

The article, using specific session data, demonstrates how large intermediate outputs such as tool outputs, browser snapshots, and JSON logs are continuously written into the historical context and repeatedly read during agent loops.

Through this case study, the author proposes a clear optimization framework: from context structure design and tool output management to compaction mechanism configuration. For developers building Agent systems, this is not only a technical troubleshooting record but also a practical 가이드 to saving real money.

The following is the original text:

I analyzed a real OpenClaw workload and identified a pattern I believe many Agent users will recognize:

토큰 usage looks “active”

Replies also seem normal

But token consumption suddenly explodes

Below is the structural breakdown, root cause, and practical fix path from this analysis.

요약

The biggest cost driver is not overly long user messages. It’s the massive cached prefix being repeatedly replayed.

From the session data:

Total tokens: 21,543,714

cacheRead: 17,105,970 (79.40%)

input: 4,345,264 (20.17%)

output: 92,480 (0.43%)

In other words: The cost of most calls is not actually processing new user intent, but repeatedly reading a huge historical context.

The “Wait, How Did This Happen?” Moment

I initially thought high token usage came from: very long user prompts, massive output generation, or expensive tool calls.

But the dominant pattern was actually:

input: a few hundred to a few thousand tokens

cacheRead: 170k to 180k tokens per call

Meaning, the model was repeatedly reading the same massive, stable prefix every single round.

Data Scope

I analyzed data at two levels:

1. Runtime logs

2. Session transcripts

It should be noted:

Runtime logs are mainly for observing behavioral signals (e.g., restarts, errors, configuration issues)

Precise token statistics come from the usage field in session JSONL files

Scripts used:

scripts/session_token_breakdown.py

scripts/session_duplicate_waste_analysis.py

Generated analysis files:

tmp/session_token_stats_v2.txt

tmp/session_token_stats_v2.json

tmp/session_duplicate_waste.txt

tmp/session_duplicate_waste.json

tmp/session_duplicate_waste.png

Where Are Tokens Actually Being Consumed?

1) Session Concentration

One session’s consumption was far higher than others:

570587c3-dc42-47e4-9dd4-985c2a50af86: 19,204,645 tokens

Then a clear cliff-like drop:

ef42abbb-d8a1-48d8-9924-2f869dea6d4a: 1,505,038

ea880b13-f97f-4d45-ba8c-a236cf6f2bb5: 649,584

2) Behavior Concentration

Tokens mainly came from:

toolUse: 16,372,294

stop: 5,171,420

This indicates the problem lies primarily in tool call chain loops, not regular chat.

3) Time Concentration

Token peaks weren’t random; they were concentrated in a few hourly blocks:

2026-03-08 16:00: 4,105,105

2026-03-08 09:00: 4,036,070

2026-03-08 07:00: 2,793,648

What’s Actually in That Huge Cached Prefix?

Not conversation content, but mainly large intermediate artifacts:

Massive toolResult data blocks

Long reasoning / thinking traces

Large JSON snapshots

File lists

Browser scraped data

Sub-Agent conversation transcripts

In the largest session, the character count was roughly:

toolResult:text: 366,469 characters

assistant:thinking: 331,494 characters

assistant:toolCall: 53,039 characters

Once this content is retained in the historical context, every subsequent call potentially re-reads it via the cache prefix.

Specific Examples (from session files)

Massive context blocks repeatedly appeared at these locations:

sessions/570587c3-dc42-47e4-9dd4-985c2a50af86.jsonl:70

Large gateway JSON log (~37k characters)

sessions/570587c3-dc42-47e4-9dd4-985c2a50af86.jsonl:134

Browser snapshot + security wrapper (~29k characters)

sessions/570587c3-dc42-47e4-9dd4-985c2a50af86.jsonl:219

Huge file list output (~41k characters)

sessions/570587c3-dc42-47e4-9dd4-985c2a50af86.jsonl:311

session/status status snapshot + large prompt structure (~30k characters)

“Duplicate Content Waste” vs. “Cache Replay Burden”

I also measured the proportion of duplicate content within a single call:

Duplicate ratio: ~1.72%

It exists, but it’s not the main issue.

The real problem is: the absolute volume of the cached prefix is too large.

The structure is: huge historical context, re-read every call, with only a small amount of new input layered on top.

Therefore, the optimization focus is not deduplication, but context structure design.

Why Are Agent Loops Particularly Prone to This?

Three mechanisms compound each other:

1. Large amounts of tool output are written into historical context.

2. 도구 loops generate many short-interval calls.

3. The prefix changes very little → the cache re-reads it every time.

If context compaction doesn’t trigger reliably, the problem amplifies quickly.

Most Important Fix Strategies (Sorted by Impact)

P0—Don’t Stuff Huge Tool Outputs into Long-Term Context

For oversized tool outputs:

Keep a summary + reference path / ID
Write the original payload to a file artifact
Do not keep the full original text in chat history

Prioritize limiting these categories:

Large JSON
Long directory listings
Full browser snapshots
Complete sub-Agent transcripts

P1—Ensure the Compaction Mechanism Actually Works

In this data, configuration compatibility issues appeared multiple times: compaction key invalid.

This silently disables the optimization mechanism.

The correct approach: Only use version-compatible configurations.

Then verify:

openclaw doctor –fix

And check startup logs to confirm compaction is accepted.

P1—Reduce Reasoning Text Persistence

Avoid long reasoning text being repeatedly replayed.

In production: Save a brief summary, not the full reasoning.

P3—Improve Prompt Caching Design

The goal is not to maximize cacheRead. The goal is to use cache on a compact, stable, high-value prefix.

Suggestions:

Put stable rules in the system prompt.
Don’t put unstable data into the stable prefix.
Avoid injecting large amounts of debug data every round.

Practical Stop-Loss Plan (If I Had to Handle This Tomorrow)

1. Identify the session with the highest cacheRead percentage.

2. Execute /compact on runaway sessions.

3. Add truncation + artifactization to tool outputs.

4. Re-run token statistics after each change.

Focus on tracking four KPIs:

cacheRead / totalTokens

toolUse avgTotal/call

Number of calls with >=100k tokens

Largest session percentage

Signals of Success

If optimization works, you should see:

Significant reduction in 100k+ token calls

Decrease in cacheRead percentage

Decrease in toolUse call weight

Reduced dominance of a single session

If these metrics don’t change, your context strategy is still too permissive.

Reproduction Experiment Commands

python3 scripts/session_token_breakdown.py ‘sessions’

–include-deleted

–top 20

–outlier-threshold 120000

–json-out tmp/session_token_stats_v2.json

> tmp/session_token_stats_v2.txt

python3 scripts/session_duplicate_waste_analysis.py ‘sessions’

–include-deleted

–top 20

–png-out tmp/session_duplicate_waste.png

–json-out tmp/session_duplicate_waste.json

> tmp/session_duplicate_waste.txt

결론

If your Agent system seems to be running fine, but costs keep rising, first check one thing: Are you paying for new reasoning, or for massively replaying old context?

In my case, the vast majority of the cost came from context replay.

Once you realize this, the solution is clear: Strictly control what data enters the long-term context.

Original Link

이 글은 인터넷에서 퍼왔습니다: OpenClaw Burns 21.5 Million Tokens in a Day? Three Optimization Strategies Drastically Reduce Costs

Related: The Most Insane Ethereum L2: An L2 Spontaneously Organized and Built by AI Agents

Cypresses (Van Gogh) Yesterday we discussed the most strategically valuable Ethereum L2s. Today, let’s talk about the coolest Ethereum L2s. This idea seems crazy, but it’s not impossible. In simple terms, when an AI agent operates on Ethereum L1 and encounters performance bottlenecks (such as high gas fees, latency, computational limits), it could theoretically “spontaneously” initiate a migration or expansion to an L2. However, truly “inheriting and spontaneously forming an L2 chain”—meaning the agent autonomously deploys, configures, and runs a new L2—is not yet fully feasible with the 2026 technology stack. Nonetheless, with the maturation of standards like ERC-8004, such autonomous behaviors may gradually approach reality. Let’s break it down: Early Stages Resemble “Migration” More Than “Spontaneous Formation” The “Intelligence” Boundary of AI Agents Current AI agents (based on ERC-8004)…