Why AWS Still Has Multi-Day Global Outages in 2025: The Eternal Gap Between Static Files and Live Cloud Reality

On February 28, 2017, Amazon S3 in the US-EAST-1 region suffered a region-wide service disruption. The root cause was an operational mistake in the internal maintenance workflow that unexpectedly removed servers from critical S3 subsystems.

Static Configs Cannot Track Live Reality

What Went Wrong

An engineer ran a playbook intended to remove a small subset of billing-system servers. Because of a mis-entered parameter, a much larger set of servers — including those managing S3’s index and placement subsystems — was decommissioned.

Result: the metadata subsystem that resolves object paths became unavailable. S3 could no longer serve GET, LIST, PUT, or DELETE requests, globally impacting static-file and other object-storage usage in that region.

Why Implicit Assumptions Failed

The outage reveals a deeply embedded invariant that had never been formalized: “Metadata-index subsystem must always remain available for object storage operations.”

This dependency was never declared or validated by customers. Static-file hosting, website asset delivery, backups, object stores — all silently assumed S3’s metadata subsystem would never fail.

The Cascade

When the indexing subsystem went down:

  1. All object-storage requests began failing (GET, PUT, LIST).
  2. Static-file hosting broke across thousands of websites and apps depending on S3.
  3. Status dashboards — ironically hosted on S3 — remained stale or unreachable for hours.
  4. Downstream systems that relied on S3 availability faced timeouts, failed requests, degraded behavior, or total outage.

What Could Have Prevented It (RCP-Style)

If there had been a safety invariant like: “S3 metadata subsystem must be healthy before any production-critical operation proceeds”, then:

  • An external monitor (or controller) would detect the subsystem removal caused by the maintenance error.
  • The invariant violation would block further deployments, working requests, or configuration changes targeting S3-backed storage.
  • Clients could fall back to redundant storage or cached assets — avoiding a global outage.

The Broader Pattern

This outage illustrates a general class of failures: “implicit infrastructure invariants assumed but never codified.” In object-storage, CDN, cloud-storage, and distributed filesystem usage, many services silently rely on availability and consistency guarantees that are not formalized in contracts or infrastructure code.

The Path Forward

To avoid repeating this class of failure, systems need:

  1. Explicit modeling of all critical infrastructure invariants (storage metadata, indexing, placement, availability, latency).
  2. Automated validation and health checks outside vendor tooling.
  3. Guardrails preventing any production action if invariants are violated — especially for storage, network, and configuration subsystems.

When you write a Terraform file, you're creating a fossil — a snapshot of what you wanted the world to look like at that moment. But the cloud is alive.

The technology and design discipline already exist. What’s missing is adopting them for cloud-native infrastructures.

Want to see how RCP solves this?
Email us at bparanj@zepho.com.

← Back to all articles