Inside Cloudflare's Worst Outage Since 2019: How a Single Config File Broke the Internet

Image source: https://blog.cloudflare.com/1…

TL;DR

A database permissions change led to a Cloudflare outage by creating an oversized feature file, causing network failures initially mistaken for a DDoS attack.

Key Points

Highlight key points with color coding based on sentiment (positive, neutral, negative).

The Cloudflare outage was caused by a permissions change in a database system.

The network failures were initially misinterpreted as a DDoS attack.

An oversized feature file exceeded software limits, causing network failures.

The root cause was identified and resolved by reverting to an earlier version of the feature file.

Cloudflare restored services by deploying the correct configuration file globally.

Cloudflare encountered difficulties on November 18, 2025, when a permissions update in their ClickHouse database changed how metadata queries behaved. This unexpected shift produced duplicate rows in the Bot Management feature file, causing the file to suddenly double in size. The oversized file exceeded the Bot Management module’s 200-feature limit, triggering system panics inside Cloudflare’s core proxy. At first, the symptoms resembled a massive DDoS attack, but the real cause was this malformed configuration file.

Once the issue was correctly identified, Cloudflare stopped the generation and rollout of new feature files, manually inserted a known-good version, and restarted affected proxy components. By 14:30 UTC the network was stabilizing, and by 17:06 UTC all services had recovered.

This was no isolated glitch. The outage affected Cloudflare’s core CDN and security layers, Workers KV, Access, Turnstile, and even blocked many users from logging into the Cloudflare Dashboard. The root problem was a change in ClickHouse’s query behavior, which surfaced more metadata than expected, pushed the feature file past its size limit, and caused widespread HTTP 5xx errors.

Cloudflare has already begun work to harden these systems and avoid similar failures in the future. As one of the most significant outages since 2019, this incident highlights how even small changes in internal systems can ripple across a global platform.

Key Numbers

Present key numerics and statistics in a minimalist format.

200 features

The feature file size limit that was exceeded.

60 features

The current number of features used prior to the incident.

5 minutes

The frequency at which the feature file was generated.

190 minutes

The number of minutes from start of impact to core traffic largely flowing again.

200 features

Hard-stop value where the oversized file triggered the system panic (features limit).

Organizations

Key entities and stakeholders, categorized for clarity: people, organizations, tools, events, regulatory bodies, and industries.

Cloudflare Technology Company

Experienced a network outage due to a permissions change in a database system, affecting its services and customers.

Tools

Key entities and stakeholders, categorized for clarity: people, organizations, tools, events, regulatory bodies, and industries.

Bot Management System Security Feature

Its oversized feature file was a key factor in the Cloudflare network outage.

Events

Key entities and stakeholders, categorized for clarity: people, organizations, tools, events, regulatory bodies, and industries.

Cloudflare Network Outage Service Disruption

Occurred on November 18, 2025, due to a permissions change in a database system.

Industries

Key entities and stakeholders, categorized for clarity: people, organizations, tools, events, regulatory bodies, and industries.

Internet Services Industry

The sector affected by the Cloudflare network outage, impacting businesses and end users.

Timeline of Events

Timeline of key events and milestones.

November 18, 2025 - 11:05 UTC Database access control change deployed

Normal operations were ongoing with a deployment of a database access control change.

11:28 Impact starts

Deployment reaches customer environments, and the first errors were observed on customer HTTP traffic.

11:32-13:05 Investigation of elevated traffic levels and errors

The team saw rising traffic and errors in Workers KV, which at first looked like degraded KV performance affecting other Cloudflare services. They tried traffic adjustments and account limits to stabilize it. Automated alerts fired at 11:31, manual investigation began at 11:32, and the incident call opened at 11:35.

13:05 Workers KV and Cloudflare Access bypass implemented

During investigation, internal system bypasses for Workers KV and Cloudflare Access were used so they fell back to a prior version of the core proxy. Although the issue was also present in prior versions of the proxy, the impact was smaller.

13:37 Rollback of Bot Management configuration file

Work focused on rollback of the Bot Management configuration file to a last-known-good version. The Bot Management configuration file was identified as the trigger for the incident.

14:24 Stopped creation and propagation of new Bot Management configuration files

The Bot Management module was identified as the source of the 500 errors, caused by a bad configuration file. Automatic deployment of new Bot Management configuration files was stopped.

14:24 Test of new file complete

Successful recovery was observed using the old version of the configuration file, and focus shifted to accelerating the fix globally.

14:30 Main impact resolved

A correct Bot Management configuration file was deployed globally, and most services started operating correctly.

17:06 All services resolved

All downstream services restarted and all operations were fully restored.