On November 22nd, between 2:27 PM PST and 2:32 PM PST some users may have been directed to an error page stating "Sorry, we can't access that page." Additionally, they may have seen 500 responses from our API. This happened due to 50% of servers being pulled from the load balancer, which resulted in not closing connection successfully before shifting traffic to the secondary datacenter. We were able to resolve the issue by reverting the change and shifting traffic back to the primary datacenter. In addition, we are working on updating our current tools to build more guardrails to prevent similar issues from occurring in the future.
On investigation, we found that a pair of tools for managing servers and traffic distribution lacked guard rails on specific commands used in maintenance operations, which allowed commands to be run out-of-order for a routine maintenance on 11/22. This caused some problems when switching between data centers, leading to the issue accessing the site.
The following remediation actions have been completed or are planned:
Create a pair of wrapper scripts with some guard rails on broad operations.
Create a warning dialogue for shifting edge traffic for services with unusual requirements.
We are continuously working to improve Box and want to make sure we are delivering the best product and user experience. We hope we have provided some clarity here and we would be happy to answer any questions you may still have regarding this matter.
The Box Team