On February 28, 2017, the Amazon S3 service in the US-EAST-1 region experienced a major service disruption. Amazon’s post-incident summary reported that the root cause was an operational error during an internal maintenance procedure, which unintentionally removed more servers than intended from two critical S3 subsystems.
Static Configs Cannot Track Live Reality
According to Amazon, an engineer executed a command intended to take a small set of servers offline in the S3 billing subsystem. Because one of the parameters was entered incorrectly, the command removed a significantly larger number of servers, including servers supporting the S3 index subsystem and placement subsystem.
When the index subsystem was taken offline, S3 could not manage or serve requests that required access to its metadata. As a result, many S3 operations in the region—including listing buckets and objects, and in some cases object retrieval and storage requests—began to fail.
The failure of the index subsystem revealed a dependency that was not widely visible to customers: S3’s ability to process most API operations depends on the continuous availability of internal metadata and placement systems. When those systems became unavailable, higher-level operations relying on them were disrupted.
Based on Amazon’s public post-incident analysis, the outage led to:
The incident demonstrated that critical internal components—such as S3’s index and placement subsystems—are single points of failure for the larger object-storage service. Visibility into these dependencies, together with safeguards during operational procedures, are essential to prevent large-scale impact.
The event highlighted how distributed systems often rely on internal infrastructure behaviors that are not externally documented. When those internal guarantees fail, the impact can spread across services and customers that assume continuous availability.
When you write a Terraform file, you're creating a fossil — a snapshot of what you wanted the world to look like at that moment. But the cloud is alive.
Amazon’s post-incident actions emphasized improving safeguards in tooling, adding additional checks to operational commands, and ensuring that critical subsystems cannot be taken offline without appropriate validation. These improvements aim to reduce the risk of similar disruptions.