According to Facebook, the global outage that took Facebook and its other platforms offline for hours was caused by an accident during routine maintenance.
Facebook’s vice president of infrastructure, Santosh Janardhan, wrote in a blog post that the outage was “triggered not by hostile behavior, but by an error of our own making.”
The issue arose as engineers worked on Facebook’s worldwide backbone network, which includes the computers, routers, and software in its data centers across the world, as well as the fiber-optic connections that connect them.
“During one of these routine maintenance jobs, a command was issued with the intention of assessing the availability of global backbone capacity, which unintentionally took down all of our backbone network connections, effectively disconnecting Facebook data centers around the world,” Janardhan said Tuesday.
Facebook’s systems are supposed to capture such errors, but a fault in the audit tool prevented it from correctly stopping the operation in this case, according to Janardhan.
This update also caused a secondary issue, making it unable to reach Facebook’s servers, even though they were still working.
Engineers rushed to the scene to address the problem, but the added layers of security took time, according to Janardhan. The data centers are “tough to get into, and once inside, the hardware and routers are designed to be impossible to modify even when you have physical access to them,” according to the company.
Services were progressively restored after connectivity was restored to avoid traffic surges that could cause future failures.