Unless you live without social media, you most definitely witnessed the moments the Internet seemed to break down over the past eight months. Major websites experienced outages that shut down essential services and led to people being unable to download images, access their email and calendars, or even use direct messaging for hours at a stretch. Behemoths like Facebook, Google, and Apple all fell prey to breakdowns that left much of the world reeling as they realized that the Internet isn’t the safest or even the most permanent place to store their data as they once thought it to be. June and July were particularly bad months as several seemingly unrelated bugs caused outages across Google’s, Facebook’s, Apple’s, and even Twitter’s websites within a span of a few days, leading to some important questions about the stability and sustainability of the systems on which the Internet is based.
Perhaps the most massive outage experienced by Facebook occurred in March, with millions of users reporting an inability to access most of Facebook’s family of apps. For billions of people all over the world who rely on the networking giant’s social media apps for most of their communication needs, the outage came as a shock. Some couldn’t access the apps at all, and others were blocked out of certain features like stories and shared media. Users took to Twitter to air their frustrations, and Facebook was forced to resort to sharing updates on Twitter to pacify befuddled users.
There were speculations that Facebook had been the target of a distributed denial-of-service (DDoS) attack, and some claimed that a bug in the server had caused the crash. Eventually, Facebook revealed that a server configuration change had been responsible for the outage, however, no detailed explanation was provided for what the exact problem was or how they had resolved it. A subsequent outage in April was quickly resolved with, again, no explanation for what had happened. Given the company’s recent legal troubles, these outages have only served to further erode the public’s trust in the tech giant.
Google was a little more forthcoming than Facebook in explaining what had gone wrong, and it attributed the breakdown of services to a routine configuration change that had accidentally been applied to servers it was not intended to affect. This caused the servers to drop over half their network capacity, leading the network to become congested. Eventually, small latency-sensitive traffic flows were prioritized over large, less latency-sensitive traffic flows, which led to the latter being dropped, resulting in the outage. The network congestion made it hard to roll back the configuration change immediately, which is why it took three hours for Google staff to restore services to its users. This was the second time this year that YouTube experienced an outage, as it went down for an hour and a half earlier in January.
A worldwide outage of major websites like Google, Amazon, and Reddit in late June highlighted some deep-seated issues in the infrastructure of the Internet. As Internet users all over the world were left reeling from the outage of many of the most widely used websites, Cloudflare managed to identify Verizon as the source of the issue.
In this case, faulty server configuration changes weren’t to blame, but rather a system that has been around for over 20 years, known as the Border Gateway Protocol (BGP). BGP is responsible for routing traffic through Internet service providers, before directing them to services. Route leaks can cause massive volumes of traffic to be directed through networks that aren’t equipped to deal with such amounts of traffic, leading to a disruption of services. Cloudflare accused Verizon of not setting in place limits that would have shut down the route leak that took down much of the Internet. Cloudflare claimed that it was sheer “sloppiness” on Verizon’s part for not implemented IRR-filtering (IRR is the Internet Routing Registry), which has been around for over 20 years and could have successfully stopped the session that originated the outage. Companies are increasingly adopting the RPKI (resource public key infrastructure) framework that prevents route leaks and route hijacking. Cloudflare called out Verizon for refusing to enable BGP Origin Validation, an action that would have allowed RPKI to be implemented.
It’s ironic that barely a week after Cloudflare threw shade at Verizon, it experienced an outage that led to multiple major websites going down. Sites like DownDetector that often report outages were taken down by the issue, leaving many in the dark. Websites that rely on Cloudflare, including Patreon, SoundCloud, Udemy, Pinterest, Dropbox, Pinterest, Discord, Medium, Shopify, Zendesk, BuzzFeed, Nest, and Sling, were all affected by the disruption in services.
In a detailed blog post, Cloudflare explained that a “bad software deploy” had caused a spike in CPU utilization on its machines all around the world, disrupting as much as 82 percent of traffic at its worst point. The outage was attributed to a “single misconfigured rule” that had been deployed within the Cloudflare Web Application Firewall. Rolling back all the rules that had been deployed reversed the CPU utilization spike and restored Cloudflare’s services. While the Cloudflare goof-up received plenty of spotlight, Google also experienced an outage when a fiber cable in the East Coast of the United States was physically damaged. Google managed to resolve the issue by rerouting traffic until the cable was repaired.
Facebook explained that it had triggered an issue when conducting a routine maintenance check, which had affected users’ ability to share pictures and videos. Twitter offered no explanation about the issue with its direct messaging feature, though services were eventually restored. There seemed to be no respite for big companies as the very next day Apple’s iCloud, too, was hit by the loss of availability of many of its features for a three-hour period, which is believed to have been caused by a BGP issue similar to the one that hit Verizon.
The go-to website for updates about website outages, Twitter, didn’t want to miss out on all the fun, and users found themselves unable to access the website for an hour on July 11. Mobile and web users could not load tweets for the duration of the outage.
Offering a rather vague explanation, Twitter attributed the outage to an internal system change that they eventually fixed. Users were soon back to tweeting about the outage, and all was well with the world.
Despite fears about distributed denial-of-service attacks, it’s surprising that none of the major outages that occurred in the first half of the year have been due to security breaches, but rather due to systemic flaws and poor infrastructure. While technology created decades ago appears to have scaled well to continue to form the backbone of the ever-changing Internet, these incidents serve as reminders that the Internet is built on fragile infrastructure. Hopefully, major companies see these outages as wakeup calls to begin securing their networks and protecting users from more serious threats to their security in the future.
Featured image: Pixabay
IFA 2019, this year’s version of the annual consumer electronics trade show, did not disappoint. Is one of these smartphones…
IT professionals all dread getting this fevered message from employees and clients: “I’m having Outlook connectivity issues!” Here’s what you…
Here’s a script designed to start and stop virtual machines based on tags associated at the resource group level. It…
Traditional VPNs are showing their age in the modern cloud-powered workplace. That’s why software-defined perimeter solutions are in your future.
Should you disallow NUMA spanning in your Hyper-V architecture? There are two sides to this story, and you’ll get both…
Coding may not be the No. 1 job duty for cloud admins, but it is often a part of the…