Anthropic unveils three infrastructure bugs behind Claude's performance issues

TL;DR

Anthropic resolves infrastructure bugs affecting Claude AI performance, revises processes to prevent future disruptions across AWS, NVIDIA, and Google platforms.

Key Points

Highlight key points with color coding based on sentiment (positive, neutral, negative).

Anthropic identified three infrastructure bugs affecting the performance of its Claude AI models, related to routing errors, API misconfigurations, and compiler issues across different hardware platforms.

Anthropic has resolved these issues and is revising its processes to prevent future disruptions, including adding detection tests for unexpected character outputs to their deployment process.

The document highlights the complexity of maintaining consistent AI performance across multiple hardware platforms, such as AWS Trainium, NVIDIA GPUs, and Google TPUs.

The infrastructure bugs were not related to heavy load or demand but were instead due to technical misconfigurations and errors.

The overlapping nature of these bugs made diagnosis particularly challenging, affecting a small percentage of requests initially but increasing due to a load balancing change.

Key Numbers

Present key numerics and statistics in a minimalist format.

16 %

A context window routing error affected a percentage of Sonnet 4 requests on August 31.

0.8 %

A context window routing error initially affected a percentage of requests on August 5.

0.18 %

Misrouted traffic peaked at a percentage of all Sonnet 4 requests on Amazon Bedrock on August 12.

0.0004 %

Incorrect routing affected a percentage of requests on Google Cloud's Vertex AI between August 27 and September 16.

0.99

The top-p sampling threshold is set to a certain value.

0.999

The top-p sampling threshold is set to another value.

Stakeholder Relationships

An interactive diagram mapping entities directly or indirectly involved in this news. Drag nodes to rearrange them and see relationship details.

Organizations

Key entities and stakeholders, categorized for clarity: people, organizations, tools, events, regulatory bodies, and industries.

Anthropic AI Research Company

Conducted a postmortem analysis on infrastructure bugs affecting its Claude AI models.

AWS Cloud Service Provider

Provides the Trainium hardware platform used by Anthropic for deploying Claude models.

NVIDIA Technology Company

Supplies GPUs that are used as a hardware platform for Claude models.

Google Technology Company

Offers TPUs that are utilized as a hardware platform for Claude models.

Tools

Key entities and stakeholders, categorized for clarity: people, organizations, tools, events, regulatory bodies, and industries.

AWS Trainium Hardware Platform

Used by Anthropic to deploy Claude models, requiring specific optimizations for performance.

NVIDIA GPUs Hardware Platform

One of the platforms used by Anthropic for deploying Claude models, necessitating tailored optimizations.

Google TPUs Hardware Platform

Another platform used by Anthropic for Claude models, adding complexity to performance maintenance.

XLA:TPU Compiler Compiler

Involved in a miscompilation issue affecting Claude Haiku 3.5, highlighting the role of compilers in infrastructure bugs.

Events

Key entities and stakeholders, categorized for clarity: people, organizations, tools, events, regulatory bodies, and industries.

Infrastructure Bugs Postmortem Analysis Event

A detailed examination by Anthropic of three infrastructure bugs affecting Claude AI models.

Industries

Key entities and stakeholders, categorized for clarity: people, organizations, tools, events, regulatory bodies, and industries.

Cloud Computing Industry

Involved due to the use of various hardware platforms like AWS Trainium, NVIDIA GPUs, and Google TPUs.

Artificial Intelligence Industry

Central to the analysis as the bugs affected AI model performance.

Timeline of Events

Timeline of key events and milestones.

2025-08-05 Context window routing error introduced

A context window routing error was introduced, affecting approximately 0.8% of requests made to Sonnet 4.

2025-08-25 Misconfiguration deployed to Claude API TPU servers

A misconfiguration was deployed to the Claude API TPU servers, causing output corruption during token generation.

2025-08-25 Code deployed to improve token selection

Code was deployed to improve token selection, inadvertently triggering a latent bug in the XLA:TPU compiler.

2025-08-25 to 2025-08-28 Output corruption affected Opus 4.1 and Opus 4

Output corruption affected requests made to Opus 4.1 and Opus 4.

2025-08-25 to 2025-09-02 Output corruption affected Sonnet 4

Output corruption affected requests to Sonnet 4.

2025-08-29 Load balancing change increased affected traffic

A load balancing change increased affected traffic, worsening the impact of the routing error.

2025-08-31 Worst impacted hour for Sonnet 4 requests

At the worst impacted hour, 16% of Sonnet 4 requests were affected by the routing error.

2025-09-02 Misconfiguration causing output corruption rolled back

The misconfiguration causing output corruption was rolled back.

2025-09-04 Routing logic fixed and XLA:TPU miscompilation rolled back

The routing logic was fixed, and the approximate top-k XLA:TPU miscompilation affecting Haiku 3.5 was rolled back.

2025-09-12 XLA:TPU miscompilation affecting Opus 3 rolled back

The approximate top-k XLA:TPU miscompilation affecting Opus 3 was rolled back.

2025-09-16 Routing logic fix completed on Google Cloud's Vertex AI

The fix for the routing logic was completed on Google Cloud's Vertex AI.

2025-09-18 Routing logic fix completed on AWS Bedrock

The fix for the routing logic was completed on AWS Bedrock.

Long-form summary

Anthropic recently conducted a postmortem analysis of three infrastructure bugs that intermittently degraded the performance of its Claude AI models. These issues, unrelated to demand or server load, were traced to routing errors, API misconfigurations, and compiler issues across different hardware platforms. The first bug involved a context window routing error that misrouted requests to incorrect servers, affecting a significant percentage of requests at its peak. The second bug was an output corruption caused by a misconfiguration in the Claude API TPU servers, leading to incorrect token generation. The third issue was an approximate top-k XLA:TPU miscompilation, which affected token selection during text generation due to a latent compiler bug.

Anthropic has resolved these issues by fixing the routing logic, rolling back problematic changes, and working with the XLA:TPU team to address the compiler bug. The company is also enhancing its evaluation processes to better detect and prevent similar issues in the future. This includes developing more sensitive evaluations, running quality checks continuously on production systems, and improving debugging tools while maintaining user privacy.

The challenges of maintaining consistent AI performance across multiple hardware platforms, such as AWS Trainium, NVIDIA GPUs, and Google TPUs, are highlighted. Each platform requires specific optimizations, and any infrastructure change necessitates careful validation across all configurations. Despite these challenges, Anthropic aims to provide users with consistent quality responses, regardless of the platform serving their requests.