“The Internet is down, can you help me fix it?”
Aunt Mabel’s email message was staring at me from my inbox. Chuckling, I emailed her back, saying, “If the Internet is down, then how come you can email me?”
“Gmail is still working,” she replied, “and so is Google. It’s the Internet I can’t get onto.”
She meant Facebook, of course. For millions and millions of people, the Internet is almost synonymous with Facebook, Instagram, Messenger, and WhatsApp, the four big properties of the social network giant. So when Facebook suddenly went dark around all the world on Oct. 4, users everywhere panicked.
What happened? Could it happen again? Can this be prevented from happening in the future? And — for those of us who manage IT networks both large and small — could the same kind of thing happen to us?
The nutshell version
I’ve read and tried digesting several long articles about the Facebook outage including Facebook’s own explanation and this post from Cloudflare and this article by Brian Krebs. And from talking with colleagues who manage far larger and more complex environments than the one I manage, I’ve concluded that while their DNS setup wasn’t the root cause of the disaster, it probably contributed in a major way to the scope of what occurred. When a poorly conceived routing update was issued that caused their datacenters to become disconnected from each other, the result was expiry of all their DNS data. This prevented Facebook’s engineers from accessing the tools they needed to recover from the situation — even preventing some Facebook employees from entering their buildings when their security badges stopped working. Eventually, however, Facebook found a way to get their recovery tools into the hands of the right people, and things gradually began returning to normal. And soon, my Aunt Mabel was happy again, too.
However, I’m sure these public disclosures and dissections of the Facebook outage don’t tell the whole story as Facebook likely keeps some of their proprietary technologies private for security reasons—though we know that cybersecurity through obscurity usually doesn’t work and can sometimes even backfire. But this bit in the article by Brian Krebs especially caught my attention:
The source explained that the errant update blocked Facebook employees — the majority of whom are working remotely — from reverting the changes. Meanwhile, those with physical access to Facebook’s buildings couldn’t access Facebook’s internal tools because those were all tied to the company’s stranded domains.
Yikes, I thought. Don’t they have out-of-band (OOB) access set up for key elements of their infrastructure in case just such a situation as this might arise?
The importance of OOB
I wrote quite recently here on TechGenix about why out-of-band (OOB) management solutions are vital for your network. I emphasized that OOB management systems give you an alternate, dedicated and secure method of accessing your IT network infrastructure so you can remotely administer your servers, applications, and other IT assets in situations when normal access is not possible. This can mean something as simple as having a 4G or 5G router in place with a virtual private network (VPN) connection to a terminal server. That way, you can always use your cellular provider's network to configure your networking equipment using the monitor port even when the configuration itself has been trashed for some reason. In other words, you use a secondary network — usually someone else’s network—as a backdoor into your own network in the case of an emergency. Of course, this also means putting in place various security controls and policies to prevent breaches and keep your network infrastructure safe and secure.
I’m also sure, of course, that Facebook’s dedicated team of highly experienced network engineers know this and have such OOB systems in place for DR/BC situations, though configuring OOB access on a per-device basis for such a large network must be impracticable — they must leverage some sort of distributed directory system for doing this as scale, which probably means it utilizes DNS. Which makes its recent massive outage and reports of the difficulties Facebook had fixing it all the more disturbing — it smells to me that they may need to rethink certain aspects of their DNS setup that probably serves up billions of user lookups daily besides enabling network access for everyone who works in their company around the world. But such things are beyond my limited experiencing managing DNS. And the article by their engineering team doesn’t discuss (for obvious reasons) how their OOB network management is set up, so I can’t speculate any further on that subject either.
What we can learn from the Facebook outage
So, what can we learn from this Facebook outage disaster besides using OOB management to provide secure backdoors in place for accessing key network infrastructure components in DR/BC scenarios? Let’s get back to basics.
Beware of complacency. Just because your network management tools work today, it doesn’t mean they’ll work tomorrow. Think about what could cause them to stop working, then re-engineer your network accordingly to prevent a disaster. Update your DR/BC playbook accordingly and test it frequently.
Ensure an alternate collaboration solution is in place in case your main one goes down. Does your company rely on Microsoft Teams? Be sure to have Slack or Webex or even Zoom set up as a backup and make sure everyone knows how to use it.
Don’t forget that your IT staff should also be included in your business continuity plan. Disaster recovery isn’t just about having tools and knowing how to use them. It also concerns those who need to use those tools in an emergency. Do they know who to call if someone is on vacation or sick or in the hospital? Do they know who lives closest to the building in case on-premise access is required to address the situation? Do they have contact info for key co-workers on their cell phones? Is there a printed sheet listing emergency contacts pinned to their office wall at work or home in case their cell phone dies, and they need to borrow their spouse’s phone?
Make sure one or two people high up in management are assigned responsibility for the disaster recovery effort, including signing off on abrupt expenses needed for recovery. The job of this person should be twofold: keep the recovery process steadily moving forward and firmly shut down others in upper management from interfering with the recovery process. The last thing your stressed-out DR team needs is a CxO demanding status updates from them every 10 minutes while they are working hard to try to restore services. In this regard, if you set up conference rooms, either physical or virtual, for handling the recovery process, be sure to segregate your executive conference room from the room overseeing command of the incident.
Finally, large organizations, especially those in the social media domain, should also keep in mind that outages that affect their services can cause problems with other companies positioned between them and their end-users as far as Internet networking is concerned. For example, Internet service providers with large numbers of residential customers suddenly found their helpdesks overwhelmed with angry users who, being unable to get onto their beloved Facebook, Instagram, or WhatsApp, had been repeatedly pressing the reset button on their home routers trying to restore connectivity with the social network. “My router is not working!” echoed in the ears of stressed-out Tier 1 support technicians, who had to patiently explain again and again that “Facebook Does Not Equal Internet” to them.
Featured image: Shutterstock