For more informtion about our Incident Response and Communications please read this support article.

We also maintain a list of Known Product Issues separate from this site here.

[Major] Issues Accessing Box in the EMEA Region
Incident Report for Box
Postmortem
Notice and disclaimer: Customer understands and agrees that this information is subject to change. To the best of our knowledge, this is the current state and will be updated as information is confirmed.
Last Updated: November 19, 2019

Summary Impact

On November 11th at 11:00 PM PST, a planned power maintenance, performed by our colocation provider, resulted in campus connectivity impact to our Frankfurt Point of Presence (PoP). Although the scheduled maintenance wasn't directly on the cages operated by Box, it impacted the site's ability to access Box's Application Data Centers in the U.S. Certain customers in parts of Europe experienced difficulties reaching our site. The issue was remediated by manually failing over the traffic and removing Frankfurt PoP from our external Domain Name System (DNS).

Analysis

When customers access Box, their request first goes through one of Box's PoPs (Points of Presence), Box has a combination of international and domestic US-based PoPs, to enhance performance and Box experience for customers around the world. This incident only impacted customers being routed through our EMEA PoP, located in Frankfurt, Germany.

While the design of each PoP is almost identical, there are slight variations due to the local data center provider features and capabilities. In selecting and designing each location, careful measures are taken to ensure the maximum level of redundancy, network equipment (Provider and Box), power equipment (UPS, generators, RPPs, etc.), multiple WAN circuits connected through multiple meet-me rooms with diverse circuit paths both on the local campus and in the physical paths between each Box facility (under ground, aerial under sea, etc.). During this event, one of the data center provider's meet-me rooms was impacted on both the A and B power legs which affected all telecommunications carriers in the room (after their local UPS' were depleted). There was no direct impact to Box's cage or equipment. This resulted in a partial site failure scenario. Box has opened tickets with its telecom providers to determine why an impact to a single meet-me room impacted all circuits and are awaiting their RCAs.

During the impacted time frame, the external DNS health checks that should have triggered automation to disable the impacted site from actively taking traffic continued to pass. This left the site in a state where it was still taking active requests even though the site could not process them. As a remediation, the health checks have been updated to accommodate partial and full connectivity/site outage scenarios and testing of multiple failure scenarios has been initiated.

To restore customer traffic during the incident, external DNS was manually updated to disable the impacted site to restore the customer experience. After verifying site stability with the Data Center provider and validation of the Box infrastructure, external DNS was once again updated to enable traffic in the Frankfurt PoP.

Defect Remediation

The following remediation actions have been completed or are planned:

  • Changes in health check monitoring and auto-remediation process to actively respond in cases similar to this incident.
  • Partner with our colocation providers to enhance communication and notifications of site related issues.
  • Request Telco Provider RCA to understand why a single Meet Me Room, took down redundant paths.
  • Instrument additional capabilities to reduce the time needed to perform a manual failover if a site failure is not detected by automation in the future.

  • Verify network circuit path redundancy.

We are continuously working to improve Box and want to make sure we are delivering the best product and user experience we can. We hope we have provided some clarity here and we would be happy to answer any questions you may still have regarding this matter.

Sincerely,
The Box Team

Posted Nov 13, 2019 - 14:12 PST

Resolved
No additional impact has been observed and this incident is now considered Resolved. Please contact Box Support at https://support.box.com if you experience any additional issues.
Posted Nov 12, 2019 - 01:46 PST
Monitoring
Our teams have validated that affected components are recovered at this time. We are continuing to monitor for any additional impact.
Posted Nov 12, 2019 - 01:12 PST
Identified
We have identified the source of an issue that resulted in the intermittent availability of the listed Box Services. Our Engineering Teams have deployed a remediation and are actively validating that all components are fully recovered.
Posted Nov 12, 2019 - 00:46 PST
Update
We are continuing to investigate an issue affecting some Uploads, Downloads, API, and Logins/SSO. We have amended this status post to include additional components.
Posted Nov 12, 2019 - 00:33 PST
Investigating
Some users may currently be unable to log in to Box. We are investigating this behavior and will provide additional updates as they become available.
Posted Nov 12, 2019 - 00:19 PST
This incident affected: Desktop Applications (Login/SSO), Mobile Applications (Login/SSO, Uploads/Downloads), Box Web Application (Login/SSO, Uploads/Downloads), and Box Platform / API (Uploads/Downloads).