400 Million Tokens Burned Overnight — A Multi-Agent AI Postmortem

What happens when an AI agent onboarding loop burns 400M tokens in 15 hours — and why your provider choice is a $1,200 question.

5,080 API requests. Zero crashes. Zero alerts. Everything looked normal. And $1,200+ worth of tokens disappeared into a bridge with no deduplication layer.

The problem

graph TD
    A["Agent A comes online (unstable connection)"] --> B["Orchestrator sends WELCOME + onboarding docs"]
    B --> C["Agent A disconnects / reconnects"]
    C --> D["Orchestrator: 'New agent detected!' — WELCOME again"]
    D --> E["NATS bridge forwards every message blindly"]
    E --> F["Every message spawns a NEW Hermes session"]
    F --> G["Every session loads FULL context: HARNESS, constitution, skills, tools"]
    G --> H["Session processes ping, responds, exits"]
    H --> C
    F --> I["5,080 sessions x ~80K context tokens each"]
    I --> J["400 MILLION input tokens in 15 hours"]

    style A fill:#1a1a2e,stroke:#e94560,color:#fff
    style J fill:#1a1a2e,stroke:#e74c3c,color:#fff
    style F fill:#1a1a2e,stroke:#f39c12,color:#fff

Multi-agent AI systems fail differently than traditional software. A single missing deduplication check — not a crash, not a bug, not a logic error — created a positive feedback loop that burned 400 million tokens across 5,080 API requests. The system worked perfectly. It just silently set money on fire.

The difference between a $23 mistake and a $1,245 mistake? Which AI provider you picked.

What happened

Sunday, May 24, 2026. An orchestrator agent on a Mac Mini M4 discovered a new agent on the network — a secondary agent coming online on a Linux machine with an RTX 3090 GPU.

Following standard onboarding protocol, the orchestrator sent a welcome message through NATS along with onboarding documentation. That message was correct.

It just never stopped sending it.

Every 60-90 seconds, the orchestrator re-sent the same onboarding payload. The bridge forwarded every message faithfully. Each message spawned a fresh agent session loading the full startup context — HARNESS, system prompts, constitution, memory, tool registry, skill manifests, runtime instructions. Thousands of tokens per session. 5,080 times.

The numbers

Metric	Value
Input tokens consumed	~400 million
Output tokens	~3 million
API requests	5,080
Runtime	~15 hours
Sessions spawned	5,080

Why nobody noticed for 15 hours

graph LR
    subgraph VISIBLE["What we saw"]
        V1["Agents responding ✓"]
        V2["Tasks completing ✓"]
        V3["Health checks passing ✓"]
    end

    subgraph INVISIBLE["What was happening"]
        I1["5,080 sessions spawning"]
        I2["400M context tokens loading"]
        I3["Bridge forwarding duplicates"]
    end

    VISIBLE -.->|"LOOKED NORMAL"| INVISIBLE

    style VISIBLE fill:#16213e,stroke:#2ecc71,color:#fff
    style INVISIBLE fill:#16213e,stroke:#e74c3c,color:#fff

The system was technically working — messages flowed correctly, agents replied, tasks completed
Agent startup is deceptively expensive — most burn came from context loading, not model output. A tiny ping triggered tens of thousands of input tokens
Per-session budgets were useless — each session stayed within limits, but the loop spawned infinite new sessions
Rate limiting didn't help — even throttled, every request still consumed context. A slow infinite loop is still infinite
Daily dashboard checks lagged — by the time usage was checked, the loop had been running all night
Killing the process restarted it — the bridge was managed by launchd. Killing it spawned a new instance. The daemon had to be fully unloaded

Root cause

The cascade came from a three-way interaction:

Network discovery → onboarding retries → bridge with zero deduplication

The secondary agent had unstable connectivity during onboarding. It repeatedly appeared and disappeared from the network. Each rediscovery triggered another welcome event. The bridge forwarded every event. The agent processed each as brand-new.

The fix (3 changes, <50 lines)

graph TD
    A["Before: every message = new session"] --> B["FIX 1: Message deduplication"]
    B --> C["FIX 2: Session spawn protection"]
    C --> D["FIX 3: Real-time token alerts"]

    B --> B1["Hash onboarding payloads<br/>Ignore duplicates in cooldown window"]
    C --> C1["Collapse repeated events<br/>into single active session"]
    D --> D1["Monitor token velocity<br/>Alert on abnormal spikes"]

    style A fill:#1a1a2e,stroke:#e74c3c,color:#fff
    style B fill:#16213e,stroke:#2ecc71,color:#fff
    style C fill:#16213e,stroke:#2ecc71,color:#fff
    style D fill:#16213e,stroke:#2ecc71,color:#fff

Message deduplication — the bridge hashes incoming onboarding payloads and ignores duplicates within a cooldown window
Session spawn protection — repeated onboarding events from the same agent are collapsed into a single session
Real-time token monitoring — if token velocity spikes abnormally, the bridge alerts immediately

Implementation: nats-agent-state-sharing/bridge

The cost — provider comparison

Same bug. Same 5,080 requests. Same 400M input tokens. Only the API provider changed.

Provider	Estimated Cost	% of Monthly Budget
DeepSeek	$22.97	~5% of a $450 plan
Moonshot Kimi	~$392	~39% of a $1,000 plan
Anthropic Claude Sonnet	~$1,245	~125% of a $1,000 plan
OpenAI GPT-5-class	~$2,090	~209% of a $1,000 plan

That's the difference between "well... that was horrifying" and "we need to explain this to accounting."

Provider pricing isn't just about cost optimization — it's insurance against infrastructure bugs that silently burn resources.

Why not just [alternative fix approach]?

Alternative	Problem	Why it fails
Rate limiting alone	Every request still consumes context tokens	A slow infinite loop is still infinite. Rate limiting slows the burn, doesn't stop it
Per-session token budgets	Each session stays within limits	The bug spawns NEW sessions. Budgets per session are irrelevant
Kill the process	launchd restarts it automatically	You have to know to UNLOAD the daemon, not just kill the process
Daily dashboard monitoring	15-hour lag between check and detection	Real-time alerts are the only reliable defense for token-burn bugs
Skip onboarding retries	Legitimate retries are needed for unstable connections	The fix is deduplication, not removing retries entirely

Stats and context

400 million input tokens — equivalent to processing the entire Harry Potter series ~400 times
5,080 API requests in 15 hours — roughly one every 10 seconds
$1,200+ potential cost difference between cheapest and most expensive provider for the identical bug
3 lines of code stopped the loop: a hash check, a cooldown window, and a session collapse
0 visible failures during the entire incident — the scariest part

Quick start

This repo is a postmortem article. There's no software to install — the fix is documented for your own agent systems.

# Read the postmortem
cat README.md

# The fix implementation lives here:
git clone https://github.com/nerudek/nats-agent-state-sharing

What you can take from this:

If you run multi-agent systems, add message deduplication to your bridge TODAY
If you use launchd/systemd for agent daemons, know the difference between kill and unload
If you check token usage daily, add a real-time velocity alert
If you pick an AI provider, factor in "bug insurance" — not just base pricing

Installation

No installation. This is a postmortem.

To implement the fix for your own system, see the bridge deduplication code in the companion repo.

Repository structure

hermes-token-loop-postmortem/
├── README.md              # THIS FILE — the full postmortem
├── SKILL.md               # Agent skill file for reference
├── banner.png             # Cover image
├── cover.png              # Social preview
├── usage-may24-yesterday.jpg  # Dashboard screenshot (262M tokens)
└── usage-may25-today.jpg      # Dashboard screenshot (134M tokens)

Known problems

Problem	Status	Notes
Deduplication adds latency (~5ms per message)	By design	Hash computation is negligible. Acceptable trade-off for loop prevention
Cooldown window can mask legitimate re-onboarding	Mitigated	Window is configurable per agent. Default: 5 minutes
Token velocity alerts can false-positive during batch operations	Open	Tune threshold per workload. PRs welcome

Contributing

Have you survived a similar loop? Open an issue or PR with your story. Multi-agent failure modes are under-documented — the more postmortems we share, the fewer people rediscover the same bugs.

Similar incidents: Open an issue with the incident-report label
Fix improvements: PR to the nats-agent-state-sharing repo
Cost comparison data: PR with additional provider pricing

License

MIT — see LICENSE.

Built by nerudek — May 2026

☕ Support: PayPal.me/nerudek | Dev.to

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

400 Million Tokens Burned Overnight — A Multi-Agent AI Postmortem

The problem

What happened

The numbers

Why nobody noticed for 15 hours

Root cause

The fix (3 changes, <50 lines)

The cost — provider comparison

Why not just [alternative fix approach]?

Stats and context

Quick start

Installation

Repository structure

Known problems

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
README.md		README.md
SKILL.md		SKILL.md
banner.png		banner.png
cover.png		cover.png
usage-may24-yesterday.jpg		usage-may24-yesterday.jpg
usage-may25-today.jpg		usage-may25-today.jpg

Folders and files

Latest commit

History

Repository files navigation

400 Million Tokens Burned Overnight — A Multi-Agent AI Postmortem

The problem

What happened

The numbers

Why nobody noticed for 15 hours

Root cause

The fix (3 changes, <50 lines)

The cost — provider comparison

Why not just [alternative fix approach]?

Stats and context

Quick start

Installation

Repository structure

Known problems

Contributing

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages