Why AWS Still Has Multi-Day Global Outages in 2025: The Eternal Gap Between Static Files and Live Cloud Reality

On February 28, 2017, the Amazon S3 service in the US-EAST-1 region experienced a major service disruption. Amazon’s post-incident summary reported that the root cause was an operational error during an internal maintenance procedure, which unintentionally removed more servers than intended from two critical S3 subsystems.

Static Configs Cannot Track Live Reality

What Went Wrong

According to Amazon, an engineer executed a command intended to take a small set of servers offline in the S3 billing subsystem. Because one of the parameters was entered incorrectly, the command removed a significantly larger number of servers, including servers supporting the S3 index subsystem and placement subsystem.

When the index subsystem was taken offline, S3 could not manage or serve requests that required access to its metadata. As a result, many S3 operations in the region—including listing buckets and objects, and in some cases object retrieval and storage requests—began to fail.

Why Dependencies Failed

The failure of the index subsystem revealed a dependency that was not widely visible to customers: S3’s ability to process most API operations depends on the continuous availability of internal metadata and placement systems. When those systems became unavailable, higher-level operations relying on them were disrupted.

The Cascade

Based on Amazon’s public post-incident analysis, the outage led to:

  1. Failures in S3 operations requiring metadata access, such as LIST and some GET and PUT requests.
  2. Disruptions for applications and websites that relied on S3 for object retrieval, storage, or static-file hosting.
  3. Failures in AWS services that internally depend on S3, including the AWS status dashboard, which could not update because it stored content in S3.
  4. Downstream service degradation for systems relying on S3 as a source of configuration files, assets, logs, or state.

A Preventive Perspective

The incident demonstrated that critical internal components—such as S3’s index and placement subsystems—are single points of failure for the larger object-storage service. Visibility into these dependencies, together with safeguards during operational procedures, are essential to prevent large-scale impact.

The Broader Pattern

The event highlighted how distributed systems often rely on internal infrastructure behaviors that are not externally documented. When those internal guarantees fail, the impact can spread across services and customers that assume continuous availability.

Strengthening Safety

When you write a Terraform file, you're creating a fossil — a snapshot of what you wanted the world to look like at that moment. But the cloud is alive.

Amazon’s post-incident actions emphasized improving safeguards in tooling, adding additional checks to operational commands, and ensuring that critical subsystems cannot be taken offline without appropriate validation. These improvements aim to reduce the risk of similar disruptions.

← Back to all articles