Postmortem reading

Reading the AWS US-EAST-1 December 2021 Outage

Last reviewed on 4 May 2026.

On 7 December 2021, AWS's us-east-1 region experienced a multi-hour outage that affected a wide range of services and the AWS Console itself. AWS later published a detailed summary of what happened. The incident is worth re-reading periodically because the failure mode — automated capacity scaling triggering a feedback loop in the network control plane — illustrates a class of API design problem that recurs in any system at scale. This is a reading of the public summary through that lens.

Sources: AWS's official Summary of the AWS Service Event in the Northern Virginia (US-EAST-1) Region, published December 2021 (publicly archived at aws.amazon.com/message/12721/). All factual claims about the incident are sourced from this public material; the analysis is editorial.

The shape of what happened

The public summary describes the incident as triggered by an automated activity that scaled up capacity in one of the internal services AWS uses for managing its network. That capacity-scaling event produced a large surge in traffic on the internal network's monitoring and management plane. The monitoring system, faced with the surge, behaved in ways that exacerbated rather than relieved the congestion — what AWS describes as "an unexpected behavior" that prevented the network devices from communicating with each other normally.

The downstream effects propagated. Many AWS services in us-east-1 rely on internal AWS services for their own operations. As the affected internal network had problems, services that depended on it had problems. The AWS Console and many service APIs experienced elevated error rates and latency. Customers running applications in us-east-1 saw their own services degrade in proportion to how much they depended on the affected substrate.

Recovery was complicated by the same dependency relationships. The tools AWS would normally use to diagnose and recover from a network problem were themselves running on the affected network. Many of the customer-facing systems that customers would use to understand the situation — the AWS Service Health Dashboard, the Personal Health Dashboard — were also affected.

What this illustrates about API design

The control plane and the data plane have different requirements

One of the recurring themes in AWS's own analysis is the distinction between the data plane (the part that handles actual user traffic — running EC2 instances continuing to serve requests) and the control plane (the part that lets customers manage their resources — launching new instances, changing configuration). During the incident, much of the data plane kept working: existing EC2 instances continued to run, existing S3 objects could be read and written. But the control plane suffered: customers couldn't reliably launch new instances, modify load balancers, or trigger automated recovery actions.

The lesson for API design is that these two planes need to be designed with different priorities. The data plane is what most customers care about for steady-state operations; it should be hardened against the most failure modes. The control plane is what customers reach for during incidents; it must not depend on the systems whose failure modes the customer is trying to mitigate.

For your own APIs: the management endpoints (creating accounts, rotating credentials, configuring rate limits) should not share infrastructure with the operational endpoints in a way that means a failure of the operational endpoints takes the management endpoints down too. When customers most need the management endpoints is exactly when the operational endpoints are likely to be having problems.

Status pages should not depend on the system whose status they report

The detail that the AWS Service Health Dashboard itself was affected is the single most-discussed element of this incident. It's a recurring failure mode: the page that exists to tell customers what's broken cannot be served because the thing that's broken is what the page runs on.

The principle: the status page must run on infrastructure that is independent of the systems whose status it reports. Different region, different cloud, different domain, different DNS, different CDN, different monitoring source. If the status page goes down at the same time as the API, you don't have a status page — you have another endpoint that fails alongside the rest.

For your own APIs, the implication is that even small projects should host their status page somewhere outside their main hosting environment. The cost of doing this is small (a static site on a different provider). The cost of getting it wrong is your customers loudly speculating in public about whether you exist at all.

Automated systems can amplify their own failures

AWS's summary notes that the unexpected behavior of the monitoring system contributed to the duration of the incident. This is a common pattern: an automated system designed to mitigate problems acts on incomplete or incorrect information during an incident, and its actions make the incident worse rather than better.

The lesson is not "don't automate." It's that automated remediation needs an off switch, and the off switch must work even when the rest of the system is broken. The classic example is auto-scaling: an auto-scaler that adds instances in response to elevated latency is appropriate when latency is elevated because of legitimate load; it's pathological when latency is elevated because of an upstream dependency failure that adding more instances can't address. The auto-scaler can't tell the difference; the operator on call can.

For API design, the same principle applies to anything the API does automatically: rate-limit auto-tuning, auto-scaling of customer quotas, automatic key rotation. Each of these should have a manual override that is reachable even when the systems they manage are degraded.

Cross-region dependencies show up only during incidents

Many customers discovered during the December 2021 incident that systems they had architected as "multi-region" still had latent dependencies on us-east-1. Some AWS services are themselves anchored in us-east-1 — the IAM control plane historically was, and Route 53's API runs there — so even applications running entirely in other regions felt impact when their administrative or DNS operations needed to talk to us-east-1.

The lesson for any API consumer is to actually test the failure scenario. "Multi-region" is an architectural property; it has to be verified by running with one region unavailable, not by drawing it on a whiteboard. The latent dependencies show up only when one region is actually down — and at that point, they are very expensive to discover.

For API providers: be explicit about which regions are dependencies, and which operations require them. AWS now documents which services have global control planes; this documentation matters. The customer's architecture is built on assumptions about your dependency graph, and if those assumptions are wrong, the failure mode is correlated outages they thought they had insulated themselves from.

"Eventual consistency" is harder when the basis-of-truth is degraded

Several of the cascading effects during the incident were caused by services that normally operate eventually-consistently against their underlying data store, but that depend on the data store being reachable for the eventual reconciliation to happen. When the network problems cut services off from their state, the "eventual" became "indefinite."

For API design, this is a reminder that "eventually consistent" is a property of normal operation, not a guarantee about all possible failure modes. APIs that promise eventual consistency under normal conditions should be explicit about what happens during partition: do reads return stale data, do writes get rejected, do clients see degraded behaviour, or does the API fail entirely? The failure mode is part of the contract whether you document it or not; documenting it is better.

What customers learned (or should have)

The incident produced a wave of "lessons learned" pieces from major AWS customers. Three themes recurred:

Cross-cloud or cross-region failover is harder than the architecture diagram suggests. Most "multi-cloud" or "multi-region" architectures discovered specific dependencies on the failed region that they hadn't accounted for.
Status pages need to be hosted independently. Many companies whose own APIs depend on AWS discovered their status pages were also down, leaving customers without authoritative information.
Automated retry behavior amplified the problem. Systems that retried aggressively against the degraded AWS APIs both extended their own outage duration (because the retries kept failing) and contributed to the load on the recovering services.

What it means for your own API design

If you're building an API, three changes are worth making in the wake of reading this incident:

Separate the management surface from the operational surface. Different infrastructure, different deployment pipeline. When the operational surface has problems, customers reach for the management surface to mitigate; if both are down together, you've removed their ability to help themselves.
Host your status page completely separately from your API. Different DNS, different hosting provider, different monitoring source. The cost is tiny; the value during the worst incident of the year is enormous.
Document your failure modes explicitly. What happens during partition? What happens when an upstream you depend on is degraded? Customers will encounter these eventually; either they learn from your documentation in advance or from your incident in the moment.

What it means for your API integrations

If you're consuming APIs, the recurring lessons:

Build the off switch first. For any API you call, you need a circuit breaker that lets you stop calling that API when it's clearly down. The pattern is covered in the integration guide; the December 2021 incident is a concrete example of why it matters.
Test your dependency graph by removing dependencies. "We're multi-cloud" needs verification. Pick a region or a provider and turn it off in a controlled test; see what breaks. The dependencies that show up are the ones to redesign.
Watch your retry behavior under sustained failure. Aggressive retries make recovery slower for everyone. The patterns from rate limiting and the integration guide both apply: exponential backoff, circuit breakers, and a global retry budget.

The broader pattern

Most large incidents share a structure: a triggering event (often automated, often routine) interacts with a feedback loop in the system in a way that wasn't anticipated; the systems designed to detect and recover from problems are themselves caught up in the failure; the recovery takes longer than it should because the recovery tools depend on the systems being recovered.

The structural defenses are similar across cases:

Separate the diagnostic and recovery infrastructure from the production infrastructure.
Make automated remediation overridable.
Be explicit about your dependency graph, including the implicit ones.
Build in circuit breakers everywhere, both within your system and at every upstream boundary.
Test the failure modes you design for; the ones you don't test, you don't actually have.

None of these are novel ideas. Their value lies in being applied consistently before the incident that requires them, not after.

Where to go next

For the integration patterns that protect you against upstream API failures, see the integration guide. For the rate-limiting algorithms that govern how aggressively your client should retry, see API Rate Limiting Strategies. For the broader picture on API security and operational resilience, see API security.