Postmortem reading

Reading the Cloudflare November 2023 Control-Plane Outage

Last reviewed on 4 May 2026.

On 2 November 2023, Cloudflare experienced a multi-day outage of its control plane and analytics services, triggered by a power failure at one of its data centers in Hillsboro, Oregon. The CEO, Matthew Prince, published an unusually candid postmortem. The incident is worth re-reading as a case study in how regional redundancy can be defeated by single-point dependencies, and what that means for any API provider that runs critical infrastructure.

Sources: Cloudflare's "Post Mortem on Cloudflare Control Plane and Analytics Outage", published November 2023 by Matthew Prince at blog.cloudflare.com/post-mortem-on-cloudflare-control-plane-and-analytics-outage/. All factual claims about the incident are sourced from this public material; the analysis is editorial.

The shape of what happened

Cloudflare's published account describes the trigger as a power failure at a Flexential data center in Hillsboro that Cloudflare relied on for its core control-plane services. The data center had multiple redundant power feeds; the failure cascaded through several of them in sequence, eventually taking the entire facility offline.

The data plane — the network of edge servers that actually serve customer traffic — was not directly affected. Customer websites and APIs running through Cloudflare continued to receive and respond to requests through the duration of the incident. What broke was the ability for customers to manage their configurations: the dashboard, the API for changing settings, the analytics that show what's happening on the network.

The recovery took approximately two days. Some services came back within hours; others required substantial reconstruction because the affected data center held the primary copy of certain databases and the failover process had not been fully exercised. The CEO's postmortem was unusually direct about which decisions in the design and operation of the system had contributed to the duration: notably, that despite years of plans to make the control plane redundant across multiple data centers, the work had not been prioritized to completion.

What this illustrates about API design

The data plane vs control plane distinction, again

The same lesson surfaces here as in the AWS postmortem reading: the customer-facing data plane (what handles user traffic) and the control plane (what lets customers manage their configuration) have fundamentally different requirements and should be designed with different priorities. Cloudflare's data plane is multi-region by design — that's the whole product. The control plane was, until this incident, anchored to a single facility.

This isn't unique to Cloudflare; it's a recurring pattern in cloud and SaaS infrastructure. The data plane gets the multi-region treatment because customers experience its failures directly. The control plane gets less attention because customers don't rely on it during steady-state operations — until they do, when the data plane has a problem and they need to make changes to mitigate.

The implication for any API provider: management endpoints should be at least as redundant as operational endpoints. Customers reach for the management endpoints precisely when the operational endpoints are degraded, which is exactly when shared infrastructure failures correlate them.

Multiple redundant inputs aren't redundant if they share a failure mode

The Hillsboro data center had multiple power feeds. They were redundant against the failure of any one feed. They were not redundant against the cascading failure that took out multiple feeds in sequence — because the failures were caused by a common upstream condition (in this case, an overload that propagated through the local utility's grid handling, as the postmortem describes).

The general principle: redundancy protects against independent failures. Two of the same kind of system, even in different physical locations, share failure modes (same hardware vendor, same software version, same operational team, same monitoring system). True independence is hard to achieve and easy to assume incorrectly.

For API design, the implication is that the architecture diagram showing two regions is necessary but not sufficient. The follow-up question is: what failure modes do those two regions share? If both regions run the same control-plane software, an upgrade bug in that software takes both down. If both regions are managed by the same operational tooling, a misconfiguration affects both. The redundancy is real only against the failure modes that genuinely vary between the redundant copies.

Failover is a muscle that requires exercise

The Cloudflare postmortem is direct about this point: the company had long-standing plans to make the control plane resilient to the loss of any single facility, and had not completed the work. When the facility was lost, the failover that should have been routine was not routine — it required substantial manual intervention because the procedures hadn't been tested at the scale they were now required to handle.

This is a near-universal pattern. Failover code that isn't regularly exercised stops working: dependencies drift, secrets expire, runbooks become stale, the people who knew how to run it leave the company. The only failover that works during an incident is the failover that has been used recently in a controlled test.

For any API provider with a redundancy story: schedule controlled regional failover tests at a regular cadence. The cost is the operational disruption of the test; the alternative is finding out during the incident that the failover doesn't work. Major cloud providers do this routinely; smaller API providers often don't, and pay for it the first time they need to fail over.

Honest postmortems strengthen the API contract

The Cloudflare postmortem is worth reading as a piece of writing, not just for its content. It names the design decisions that led to the duration of the incident. It names the work that should have been completed and wasn't. It commits to specific improvements with timelines.

The pattern matters because the postmortem is itself part of the API contract — not the legal contract, but the trust contract. Customers building on top of an API are extending trust to the provider; honest accounts of failures and their causes are a way of paying that trust back. The opposite — vague postmortems that blame "an unfortunate combination of factors" without specifics — erodes trust faster than the original incident.

For your own APIs: if you have a public service, develop the discipline of writing detailed postmortems for material incidents and publishing them. The customers who matter most — the engineering teams building on top of you — read these closely and adjust their reliance accordingly.

The recovery profile is part of the design

Two days is a long time. Even for a control-plane outage where the data plane stayed up, two days of "you can't change your configuration" is a meaningful business impact for many customers. The duration was driven not by the inability to fail over (which was achieved within a day) but by the work of restoring some services from backups.

For API providers, this is a reminder that the recovery profile — how long it takes to come back up after each kind of failure, and which capabilities are restored in what order — is a designed property, not an emergent one. If your worst-case recovery is days because you've never thought about the recovery sequence, you'll discover that during the incident.

The exercise is small: walk through each major failure mode you've identified, and answer: what's down, what comes back first, what takes longest, what data could be lost. Then prioritize the work to shorten the longest item.

What customers learned

Three observations from the public response from Cloudflare customers:

Customers who depended on the Cloudflare API for runtime decisions had a worse incident than customers who only used it during configuration changes. If your application calls the Cloudflare API during normal request handling — to fetch configuration, to interact with Workers KV, to retrieve cached data — your application's availability depended on Cloudflare's control plane during this incident. Decoupling runtime behavior from control-plane availability is design work that pays off here.
Customers without a tested fallback for DNS or CDN had no good options. Cloudflare's data plane stayed up, so most customers were not directly affected. But for customers who hit issues that required configuration changes, the inability to make them was disruptive. Having a documented "what we do if our CDN provider's control plane is down for a day" runbook is part of the operational maturity that separates well-run systems from fragile ones.
The status page worked. Cloudflare's status page is hosted independently of its main infrastructure; it remained available and was updated regularly throughout the incident. This is the lesson from the AWS December 2021 incident applied correctly.

What it means for your own API design

Three concrete actions:

Identify your single points of failure honestly. A multi-region architecture diagram is a starting point, not a guarantee. Walk through what happens if your primary database goes down, if your DNS provider has an outage, if the data center hosting your control plane loses power. The dependencies that surface are the ones to harden.
Test the failover paths you claim to have. Quarterly is a reasonable cadence for major systems. The first test will reveal that the failover doesn't work the way you think; the value is in finding that out under controlled conditions.
Write postmortems publicly when material incidents happen. Specific, candid, with the design decisions named. The credibility gain is real; the documentation forces the organization to understand the failure rather than pattern-matching it to the last one.

What it means for your API integrations

Two concrete actions for anyone consuming APIs:

Decouple runtime from control plane. If your application calls a third-party API during user request handling, you have a hard dependency on that API's data plane. If your application calls a third-party API to fetch configuration that's then cached locally, you have a soft dependency on the control plane. Designing for the soft dependency where possible reduces the radius of incidents.
Have a "control-plane down" runbook for each critical provider. Not "what do we do if Cloudflare is down" — what do we do if Cloudflare's customer dashboard and API are unavailable for two days, but our cached configuration continues to work? The answer is usually "wait it out and don't make panic decisions", but knowing that in advance is much better than figuring it out at hour 24.

The broader pattern

The pattern across the major cloud and SaaS postmortems of recent years is consistent: data plane failures are catastrophic but rare; control plane failures are more common and chronically underweighted in design. The work to make control planes as redundant as data planes is unglamorous, often gets deferred, and only matters during incidents — but those incidents, when they happen, define the customer's perception of the provider for years.

For any API provider operating at scale, the work is worth doing before the incident that requires it. For any API consumer, the awareness that this dependency exists shapes the architecture decisions that determine your own resilience.

Where to go next

For the integration patterns that handle upstream API outages, see the integration guide. For the rate-limiting and retry behavior that contributes to recovery, see API Rate Limiting Strategies. For the broader picture of operational resilience in API design, see the AWS US-EAST-1 reading for a complementary case.