Last week, I was inveighing against catastrophic data loss in South Korea, where a fire at a government data center incinerated a decade’s worth of work for some 700,000 civil servants while knocking out more than 600 services. I wrote, “… our reliance on centralized systems for such a widespread number of critical services demands deeper introspection.”
Well, here we are again.
This morning, Amazon Web Services suffered a catastrophic outage in its US-EAST-1 data center complex in Northern Virginia starting at around 3am Eastern and continuing into the morning rush. As of writing at 10am, 71 services at the data center remain offline, while 37 have been brought back online. Amazon’s US-EAST-1 is one of the most important data centers in the world, and its failure has led to outages of dozens of well-known applications according to DownDetector and social media reports.
It’s obvious that we are building our critical applications on foundations of sand; just as with South Korea, an outage in US-EAST-1 is hardly unprecedented. As The Verge notes, “AWS outages in the US-East-1 region have created widespread disruptions in 2023, 2021, and 2020, forcing multiple websites and platforms offline for several hours before regular service was restored.” Yet, developers and IT architects keep on returning to that same data center, while failing to mitigate against its predictable outages.
If you are thinking “WTF,” I don’t blame you. This is far beyond technical acumen or engineering prowess, and far more about organizational design around risks and rewards. It reminds me of Dan Davies’ book The Unaccountability Machine and his notion of “accountability sinks.” The premise is that organizations adapt themselves to ensure no one working at them can be held accountable for what the organization does. Stafford Beer’s koan that “the purpose of a system is what it does” needs a Davies corollary, namely that “no one here chose that purpose, sorry.”
In other words, rather than taking a resilient and risk-mitigating approach to engineering these systems to ensure their robustness under even complicated and unique outages, organizations have chosen cover-your-ass-at-all-costs by flocking to the same US-EAST-1 data center in mutually-assured defensibility. Don’t blame us, it’s Amazon!
In fact, that’s basically a perfect sales pitch for Amazon’s services. No one can be fired by selecting Amazon, since the alternative would be an equally incompetent employee at another organization making the exact same decision. This herd mentality induces catastrophic and correlated societal risks. As the world has become more deeply integrated, these outages and failures metastasize farther.
This is not the butterfly effect causing a hurricane with a flap of the wings. Every one of these crises was predictable, and therefore they are all possible to mitigate.
As the fire in South Korea and this morning’s AWS outage show, essential technical infrastructure has a habit of going offline. Cloudflare pushed out a bad patch and computers worldwide were bricked. Jeep pushed out an over-the-air update last weekend and summarily froze the Wrangler 4xe for many owners. Texas lost all power in 2021, the Suez Canal was blocked by a cargo ship, and the Panama Canal sometimes lacks sufficient water to allow vessels to pass.
These events can feel random when reading headlines, but they aren’t emergent behaviors from complex systems that are impossible to foresee. This is not the butterfly effect causing a hurricane with a flap of the wings. Every one of these crises was predictable, and therefore they are all possible to mitigate.
Even the most robust organization will struggle to maintain the highest incentives for safety and risk mitigation.
There is stochastic degradation of any real-world system, which means we must diligently conduct chaos maintenance to ensure stability. Yet, such maintenance is increasingly not budgeted or worse, per Davies, is just actively ignored in the hopes that no one can be held accountable for any failures.
I’m a strong nuclear energy advocate, but if you are curious why so many people don’t believe humanity could ever safely run a complex technical system that could melt down, this is why. Even the most robust organization will struggle to maintain the highest incentives for safety and risk mitigation. I reference Charles Perrow and his book Normal Accidents perhaps too much on Riskgaming, but his central thesis bears repeating. Highly-coupled and highly-complex systems will inevitably fail due to the interlocking and emerging interactions of the underlying assemblage of components. We can reduce such failures, but we can’t eliminate them.
A single computing service having an outage shouldn’t lead to push notifications from every major news service and a major concern that stock markets will crash today (that’s President Trump’s job!) We can’t allow accountability to be shorn from the critical nature of these applications to society. Hospital systems fail when servers go down, and there are patients relying on the outcome of those computations. With power comes responsibility they say, or used to. I don’t know, my social media feed is blank.







Hadn't been able to read this until just now...because Substack also uses AWS?