On November 18, 2025, Cloudflare's network began experiencing significant failures to deliver core network traffic after a change to a database system used by its Bot Management feature pipeline.
Hidden Invariants Break Without Warning
The incident was triggered when a change to database permissions caused the query that generates a Bot Management "feature file" to return duplicate feature rows, which increased the size of that file. The proxy software that processes this file enforced a limit on the number of features it could load; when the larger-than-expected file was propagated across the network, that limit was exceeded and the software failed, returning HTTP 5xx errors.
Cloudflare later explained that a permissions change in a ClickHouse cluster caused a query used by the feature-file generator to start returning additional metadata. This more than doubled the number of features described in the generated file. When files with more than the allowed number of features were distributed, the Bot Management module hit its configured limit and panicked, which in turn caused errors in the core proxy handling customer traffic.
Based on Cloudflare's public post-incident report, the sequence of events was:
Cloudflare's write-up notes that assumptions about database query behavior and feature limits contributed to the outage. Making such limits explicit at the interface between configuration-generation systems and runtime modules, and validating generated artifacts against those limits before network-wide rollout, can reduce the risk that a change in one component will cause unexpected failures in another.
The 30-second timeout was added three years ago. It worked fine in testing. It worked fine in staging. It was never formally validated against actual runtime behavior.
Safety-critical fields such as aviation, space, and nuclear operations commonly treat assumptions and limits as explicit contracts and use procedures and automated checks to verify them during design, testing, and operation. Applying similar discipline to large-scale cloud infrastructure is one way to reduce the impact of configuration and integration errors.