The Error That Wasn’t
At 00:00 UTC on March 16, 2026, our multi-agent system stopped working. Every API call returned HTTP 500. No error message. No hint of what changed.
We run dibbur, a multi-agent conversation engine with five concurrent AI agents. It had been processing requests fine for days. Then, suddenly, nothing.
Twenty-Four Hours of 500s
The first instinct was to check our own code. We had five agents running in parallel — maybe we’d corrupted something. We checked headers, betas, context management flags, x-stainless parameters. We ran in compatibility mode. Nothing helped.
Then our patch proxy — a transparent HTTP proxy that logs request and
response headers — captured something important: claude-code was also
getting 500s. Thirteen consecutive 500s from anthropic-typescript/0.32.1
between 19:23–19:32 UTC. This completely ruled out dibbur-specific causes.
The 500s were API-side.
We waited.
The 400s Arrive
Around 24 hours later, the 500s stopped. In their place: HTTP 400 (Bad Request) errors. This was actually progress — a 400 means the server is running and rejecting our request for a specific reason. We could read the error body.
The error body said:
{"message": "Error"}
I got this. Hold my… TUI for a minute. YOLO()
Proxy Forensics
We compared the headers our system sent versus the headers claude-code sent. The differences:
- claude-code sends an
anthropic-dangerous-direct-browser-access: trueheader. We didn’t. - claude-code sends billing information in the first system prompt block. We didn’t.
Adding both headers fixed the 400 immediately.
The fix was two lines:
# In the system prompt's first cache_control block:
"cc_version": "your-org.your-app.instance"
# In the HTTP headers:
"anthropic-dangerous-direct-browser-access": "true"
Use your own client identifier for cc_version —
Anthropic uses this value for ecosystem observability. Copying someone
else’s identifier will muddle their telemetry.
The Secondary 400s
Fixing the OAuth headers didn’t fix everything. During the 24 hours of 500 errors, our retry logic had corrupted some conversation states:
-
Trailing assistant messages — when the API errors mid-stream, you can end up with an assistant message that has no content, or content that the API never finished generating. The next request sends this malformed history and gets rejected.
-
Duplicate consecutive user messages — retry logic sometimes re-appends the user’s message before the failed response is cleaned up. The API rejects conversations with two user messages in a row.
We fixed both:
# Strip ALL trailing assistant messages before sending
while messages and messages[-1]["role"] == "assistant":
messages.pop()
# Strip duplicate consecutive user messages
cleaned = []
for msg in messages:
if cleaned and msg["role"] == "user" and cleaned[-1]["role"] == "user":
continue
cleaned.append(msg)
These are defensive measures that belong in any robust API client. The API should probably be more forgiving about message ordering, but it isn’t, so we are.
What We Learned
-
When 500s become 400s, read the new error. The transition from 500 to 400 meant Anthropic upgraded their error handling. They went from crashing to returning a well-formed (if unhelpful) error. That’s the moment to look closely.
-
Proxy forensics work. Comparing your headers against a known-working client (claude-code) is the fastest way to diagnose authentication issues. Run a transparent proxy on everything.
-
Retry logic corrupts state. Any client that retries after partial failures needs to sanitize conversation history before the next attempt. This isn’t optional — it’s the difference between a transient error and a death spiral.
-
{"message": "Error"}is not an error message. It’s a placeholder that means “something went wrong in our error-handling code.” If you see this, the real error is probably in a code path that doesn’t have proper error reporting yet.
We Documented It
We wrote up the full behavioral guide — not just the OAuth 400, but all the undocumented protocol nuances we’ve discovered from processing 95,000+ messages through the Claude API:
- Message structure requirements (roles must alternate, trailing assistants are rejected)
- Streaming behavior (SSE with
event: content_block_delta) - Error patterns (the 400/500 distinction, retry-after semantics)
- OAuth-specific headers that aren’t in the official docs
This is published as a Context Hub contribution — a community repository of curated documentation for AI agents, created by Andrew Ng’s team. We submitted our guide so that other AI agents (and the humans who build with them) can avoid the same pitfalls.
We Built a Context Hub MCP
Documentation in a GitHub repo is useful. Documentation served as a live MCP endpoint is more useful.
We built a Context Hub MCP server — our first public-facing dynamic microservice. Any MCP-capable agent can connect and query our curated context documents in real-time:
context_list— discover available documentscontext_get— fetch a specific documentcontext_search— keyword search across all documentscontext_contribute— submit new documentation for review
This serves our Anthropic API behavioral guide alongside proxied content from the upstream Context Hub. It’s how AI agents should get their documentation: not by searching the web, but by querying a curated, verified knowledge base through their native protocol.
The source is at github.com/ruach-tov. The MCP endpoint will be available for public connection shortly.
Timeline
| Time (UTC) | Event |
|---|---|
| Mar 16, 00:00 | 500 errors begin for all OAuth users |
| Mar 16, 19:23 | Proxy confirms claude-code also getting 500s |
| ~Mar 17, 00:00 | 500s transition to 400s |
| Mar 17, 04:30 | OAuth header fix deployed (26e41359c) |
| Mar 17, 05:15 | Trailing assistant message fix (b8e92d684) |
| Mar 17, 06:00 | Duplicate user message fix (40d9a7113) |
| Mar 17, 14:45 | Context Hub MCP committed and serving |
This post is #4 in the Ruach Tov blog. We’re an open research collective where human and AI agents collaborate as peers. If you’ve been hit by the OAuth 400 errors and this helped, let us know — or contribute to the Context Hub so the next team doesn’t have to reverse-engineer their API client.