Massive Fastly internet outage causes parts of websites to go dark

After a malfunction at cloud computing service provider Fastly, masses of websites were down on Tuesday morning. Internet users could not access major news channels, e-commerce platforms and even government websites. Everyone from Amazon to the New York Times to the White House was affected, all thanks to a customer trying to change their settings.

At about 6:30 a.m. ET, Fastly said it had provided a “fix” to the problem, and many of the websites that were down appeared to be up and running again as of 9:00 a.m. ET. Still, the outage shows just how dependent, centralized, and vulnerable the infrastructure that supports the Internet — particularly cloud computing providers that the average user doesn’t interact with directly — actually is. This is at least the third time in less than a year that a problem at a major cloud computing provider has left countless websites and apps on the shelf.

Fastly is a content delivery network (CDN), which maintains a network of servers that quickly transfer content from websites to users. The company, which counts Shopify, Stripe and many media as customers, promises “lightning fast delivery” and “advanced security”. The nature of such a network also means that problems can spread quickly and affect many of those customers at once. In the case of Tuesday’s incident, Fastly say it “identified a service configuration that was causing disruptions” around the world. It took approximately two hours from the time the issue was identified until a fix was implemented.

At this time, there is no reason to suspect that the outage was the result of a cyber attack. On Tuesday evening, Fastly said the problem was the result of a software bug, apparently caused by a single customer. Yet the outage comes amid a slew of recent cyber incidents that have impacted everything from the global meat supply to a major oil pipeline in the United States.

It is nevertheless clear that the outage caused momentary chaos. The Downdetector site, which tracks complaints about website outages, shows that a slew of sites received a surge in complaints this morning, not only for media outlets like the New York Times and CNN, but also for Reddit, Spotify and Walt Disney World. Outages at payment systems like Stripe and ecommerce platforms like Shopify also suggest money could have been lost on transactions that didn’t go through, though it’s unclear as of now whether that’s the case.

All Vox Media websites, including this one, were offline for half an hour. The Verge, owned by Vox Media, switched to offer its content on Google Docs before web users flooded the document and started editing it (editors inadvertently left the page unobstructed). Kentik, an internet observation company, reported that the outage was responsible for a 75 percent drop in traffic from Fastly’s servers.

The magnitude of Tuesday’s outage — and the frequency of major outages like this one — is truly concerning. Last July, connectivity issues between two of the data centers operated by Cloudflare eventually took many sites, including Politico, League of Legends, and Discord, offline for a short time. Then, a data processing problem for Amazon Web Services last November caused problems for sites like the Chicago Tribune, the security camera company Ring and Glassdoor. Fastly’s outage shows that the trend is continuing, especially as most of the web continues to rely more and more on cloud providers.

While the issue appears to be resolved for now, it will take some time to measure the damage caused by even a few hours of downtime at a major cloud computing provider. And that leaves the world anxiously waiting for the next time this happens.

Why these disturbances feel like they are getting worse

One of the reasons the Fastly outage seems so large is that cloud computing service companies like Fastly are consolidating, leaving websites dependent on a smaller number of providers. Even if there isn’t that much total outage, the fact that so many mundane sites depend on fewer cloud providers makes each individual outage quite significant for an average Internet user who just wanted to buy some stuff on Amazon and read the New York Times early Tuesday morning.

There are benefits to consolidation, explains Doug Madory, head of internet analytics at the network monitoring company Kentik. For example, a smaller number of cloud providers means that it is much easier to have those providers implement a particular security change. “The downside is liability [of] have a few mega-corporations, be it CDNs [content delivery networks] or other types of internet companies, which are responsible for much of our internet activities,” Madory told Recode.

In other words, when one of these mega-corporations updates its systems and accidentally causes a failure, the damage radius can be quite large. Here’s what happened in 2011 when one of Amazon’s cloud computing systems, Elastic Block Store (EBS), crashed, taking Reddit, Quora, and Foursquare offline. After the incident, Amazon explained that engineers accidentally created technical issues that seeped through its systems and caused the outage.

“You end up getting more and more failures,” explains Christopher Meiklejohn, a PhD student at Carnegie Mellon’s Institute for Software Research. “They are difficult to debug. They are stressful and difficult to resolve. And if you think about making that change, they can be very difficult to detect early on because the systems are so complex and there are so many moving parts involved.”

In the case of Fastly’s outage on Tuesday, the problem appeared to be due to a bug introduced in May when the company rolled out new software. But the problem wasn’t discovered until Tuesday when a routine change to a customer’s systems caused the bug — and inadvertently shut down much of the internet, according to a summary released by Nick Rockwell, the company’s SVP of engineering and infrastructure.

Central to the challenge of systems like Fastly’s, Meiklejohn said, is the fact that these cloud computing systems can include tens of thousands of servers deployed around the world. It is very difficult for developers working on new changes to anticipate all the features of the larger system, a scenario that makes it more likely that an error will occur when updates are eventually deployed. Companies don’t always have the tools to detect these problems before they arise, although there is increasing research and effort for better solutions.

The Fastly outage also occurred amid growing cybersecurity concerns. Now many are eager to learn more details from Fastly – which markets itself as a reliable and fast service – about how its systems have gone down. The outage reminds us that the Internet is built on an increasingly complex infrastructure, one that is global and potentially impacting the sites and services of countless companies. That means that small mistakes can have major consequences.

Update, June 9, 2021, 3:40 PM ET: This piece has been updated with new information about the cause of the failure.