As an IT admin, you are constantly dealing with systems not performing as they should. Most times, the disruptions are minor, and you fix them quickly. But when you’re Facebook — a company worth nearly $1 trillion that millions of people and businesses count on — going down for even a few minutes is terrible. Going down for six hours is catastrophic.
So, why did Facebook go down yesterday? We now know it wasn’t a cyberattack or insider sabotage. It was a configuration error. Facebook’s Santosh Janardhan said “configuration changes” on routers that coordinate network traffic caused the outage.
Our engineering teams have learned that configuration changes on the backbone routers that coordinate network traffic between our data centers caused issues that interrupted this communication. This disruption to network traffic had a cascading effect on the way our data centers communicate, bringing our services to a halt.
— Santosh Janardhan, Facebook’s vice president / infrastructure
But how could “configuration changes” have such a devastating effect companywide? It was not just Facebook that went down. The outage took Facebook-owned companies WhatsApp and Instagram down with it. Even more shocking, Facebook products such as its Oculus VR headsets and its brand-new Ray-Ban Stories were unusable during the outage. And even Facebook employees were locked out of buildings and offices because the outage also took down the IT systems needed to power access badges.
Most companies, even small ones, have an IT infrastructure to ensure this couldn’t happen, or if it did happen, it would be fixed quickly and wouldn’t cripple an entire company.
For Facebook, the outage came at a time when the company was already under extreme scrutiny. Just this week, a Facebook whistleblower exposed what she said were a litany of unseemly practices by the company. And last month, a devastating expose in the Wall Street Journal published documents that showed that Facebook knows its Instagram service is “toxic for teen girls.”
In the postmortem/apology on the outage, Janardhan said, “We’re working to understand more about what happened today so we can continue to make our infrastructure more resilient.”
Good idea. But why wasn’t that done before this embarrassing and devastating incident?
Featured image: Shutterstock