The Hidden Assumption That Took Down the Internet: The Cloudflare November 2025 Outage and the Danger of Implicit Invariants

On November 18, 2025, Cloudflare's network began experiencing significant failures to deliver core network traffic after a change to a database system used by its Bot Management feature pipeline.

Hidden Invariants Break Without Warning

The incident was triggered when a change to database permissions caused the query that generates a Bot Management "feature file" to return duplicate feature rows, which increased the size of that file. The proxy software that processes this file enforced a limit on the number of features it could load; when the larger-than-expected file was propagated across the network, that limit was exceeded and the software failed, returning HTTP 5xx errors.

How the Failure Developed

Cloudflare later explained that a permissions change in a ClickHouse cluster caused a query used by the feature-file generator to start returning additional metadata. This more than doubled the number of features described in the generated file. When files with more than the allowed number of features were distributed, the Bot Management module hit its configured limit and panicked, which in turn caused errors in the core proxy handling customer traffic.

The Cascade

Based on Cloudflare's public post-incident report, the sequence of events was:

  1. A change to database permissions altered the behavior of a ClickHouse query used to generate the Bot Management feature file.
  2. The query began returning duplicate feature metadata, increasing the number of features in the generated file.
  3. The resulting feature file exceeded the Bot Management module's configured limit on the number of features it could load.
  4. The larger-than-expected feature file was propagated across the network to machines running the proxy software.
  5. When the proxy loaded the file, the Bot Management module hit the feature limit and triggered a panic, producing HTTP 5xx errors.
  6. While the change was partially rolled out, the system alternated between good and bad files until all nodes generated the bad configuration and the failures stabilized.
  7. Cloudflare mitigated the incident by stopping generation of the bad feature file, inserting a known-good file into the distribution queue, and restarting affected services.

What Might Help Prevent Similar Incidents

Cloudflare's write-up notes that assumptions about database query behavior and feature limits contributed to the outage. Making such limits explicit at the interface between configuration-generation systems and runtime modules, and validating generated artifacts against those limits before network-wide rollout, can reduce the risk that a change in one component will cause unexpected failures in another.

Broader Context

The 30-second timeout was added three years ago. It worked fine in testing. It worked fine in staging. It was never formally validated against actual runtime behavior.

Safety-critical fields such as aviation, space, and nuclear operations commonly treat assumptions and limits as explicit contracts and use procedures and automated checks to verify them during design, testing, and operation. Applying similar discipline to large-scale cloud infrastructure is one way to reduce the impact of configuration and integration errors.

← Back to all articles