Join us
@kala ・ Oct 08,2025
Anthropic resolves infrastructure bugs affecting Claude AI performance, revises processes to prevent future disruptions across AWS, NVIDIA, and Google platforms.
Anthropic identified three infrastructure bugs affecting the performance of its Claude AI models, related to routing errors, API misconfigurations, and compiler issues across different hardware platforms.
Anthropic has resolved these issues and is revising its processes to prevent future disruptions, including adding detection tests for unexpected character outputs to their deployment process.
The document highlights the complexity of maintaining consistent AI performance across multiple hardware platforms, such as AWS Trainium, NVIDIA GPUs, and Google TPUs.
The infrastructure bugs were not related to heavy load or demand but were instead due to technical misconfigurations and errors.
The overlapping nature of these bugs made diagnosis particularly challenging, affecting a small percentage of requests initially but increasing due to a load balancing change.
A context window routing error affected a percentage of Sonnet 4 requests on August 31.
A context window routing error initially affected a percentage of requests on August 5.
Misrouted traffic peaked at a percentage of all Sonnet 4 requests on Amazon Bedrock on August 12.
Incorrect routing affected a percentage of requests on Google Cloud's Vertex AI between August 27 and September 16.
The top-p sampling threshold is set to a certain value.
The top-p sampling threshold is set to another value.
Conducted a postmortem analysis on infrastructure bugs affecting its Claude AI models.
Provides the Trainium hardware platform used by Anthropic for deploying Claude models.
Supplies GPUs that are used as a hardware platform for Claude models.
Offers TPUs that are utilized as a hardware platform for Claude models.
Used by Anthropic to deploy Claude models, requiring specific optimizations for performance.
One of the platforms used by Anthropic for deploying Claude models, necessitating tailored optimizations.
Another platform used by Anthropic for Claude models, adding complexity to performance maintenance.
Involved in a miscompilation issue affecting Claude Haiku 3.5, highlighting the role of compilers in infrastructure bugs.
A detailed examination by Anthropic of three infrastructure bugs affecting Claude AI models.
Involved due to the use of various hardware platforms like AWS Trainium, NVIDIA GPUs, and Google TPUs.
Central to the analysis as the bugs affected AI model performance.
A context window routing error was introduced, affecting approximately 0.8% of requests made to Sonnet 4.
A misconfiguration was deployed to the Claude API TPU servers, causing output corruption during token generation.
Code was deployed to improve token selection, inadvertently triggering a latent bug in the XLA:TPU compiler.
Output corruption affected requests made to Opus 4.1 and Opus 4.
Output corruption affected requests to Sonnet 4.
A load balancing change increased affected traffic, worsening the impact of the routing error.
At the worst impacted hour, 16% of Sonnet 4 requests were affected by the routing error.
The misconfiguration causing output corruption was rolled back.
The routing logic was fixed, and the approximate top-k XLA:TPU miscompilation affecting Haiku 3.5 was rolled back.
The approximate top-k XLA:TPU miscompilation affecting Opus 3 was rolled back.
The fix for the routing logic was completed on Google Cloud's Vertex AI.
The fix for the routing logic was completed on AWS Bedrock.
Anthropic recently conducted a postmortem analysis of three infrastructure bugs that intermittently degraded the performance of its Claude AI models. These issues, unrelated to demand or server load, were traced to routing errors, API misconfigurations, and compiler issues across different hardware platforms. The first bug involved a context window routing error that misrouted requests to incorrect servers, affecting a significant percentage of requests at its peak. The second bug was an output corruption caused by a misconfiguration in the Claude API TPU servers, leading to incorrect token generation. The third issue was an approximate top-k XLA:TPU miscompilation, which affected token selection during text generation due to a latent compiler bug.
Anthropic has resolved these issues by fixing the routing logic, rolling back problematic changes, and working with the XLA:TPU team to address the compiler bug. The company is also enhancing its evaluation processes to better detect and prevent similar issues in the future. This includes developing more sensitive evaluations, running quality checks continuously on production systems, and improving debugging tools while maintaining user privacy.
The challenges of maintaining consistent AI performance across multiple hardware platforms, such as AWS Trainium, NVIDIA GPUs, and Google TPUs, are highlighted. Each platform requires specific optimizations, and any infrastructure change necessitates careful validation across all configurations. Despite these challenges, Anthropic aims to provide users with consistent quality responses, regardless of the platform serving their requests.
Subscribe to our weekly newsletter Kala to receive similar updates for free!
Join other developers and claim your FAUN.dev account now!
FAUN.dev is a developer-first platform built with a simple goal: help engineers stay sharp without wasting their time.
FAUN.dev
@kala