On November 11th at 11:00 PM PST, a planned power maintenance, performed by our colocation provider, resulted in campus connectivity impact to our Frankfurt Point of Presence (PoP). Although the scheduled maintenance wasn't directly on the cages operated by Box, it impacted the site's ability to access Box's Application Data Centers in the U.S. Certain customers in parts of Europe experienced difficulties reaching our site. The issue was remediated by manually failing over the traffic and removing Frankfurt PoP from our external Domain Name System (DNS).
When customers access Box, their request first goes through one of Box's PoPs (Points of Presence), Box has a combination of international and domestic US-based PoPs, to enhance performance and Box experience for customers around the world. This incident only impacted customers being routed through our EMEA PoP, located in Frankfurt, Germany.
While the design of each PoP is almost identical, there are slight variations due to the local data center provider features and capabilities. In selecting and designing each location, careful measures are taken to ensure the maximum level of redundancy, network equipment (Provider and Box), power equipment (UPS, generators, RPPs, etc.), multiple WAN circuits connected through multiple meet-me rooms with diverse circuit paths both on the local campus and in the physical paths between each Box facility (under ground, aerial under sea, etc.). During this event, one of the data center provider's meet-me rooms was impacted on both the A and B power legs which affected all telecommunications carriers in the room (after their local UPS' were depleted). There was no direct impact to Box's cage or equipment. This resulted in a partial site failure scenario. Box has opened tickets with its telecom providers to determine why an impact to a single meet-me room impacted all circuits and are awaiting their RCAs.
During the impacted time frame, the external DNS health checks that should have triggered automation to disable the impacted site from actively taking traffic continued to pass. This left the site in a state where it was still taking active requests even though the site could not process them. As a remediation, the health checks have been updated to accommodate partial and full connectivity/site outage scenarios and testing of multiple failure scenarios has been initiated.
To restore customer traffic during the incident, external DNS was manually updated to disable the impacted site to restore the customer experience. After verifying site stability with the Data Center provider and validation of the Box infrastructure, external DNS was once again updated to enable traffic in the Frankfurt PoP.
The following remediation actions have been completed or are planned:
Instrument additional capabilities to reduce the time needed to perform a manual failover if a site failure is not detected by automation in the future.
Verify network circuit path redundancy.
We are continuously working to improve Box and want to make sure we are delivering the best product and user experience we can. We hope we have provided some clarity here and we would be happy to answer any questions you may still have regarding this matter.
The Box Team