We also maintain a list of Known Product Issues separate from this site here.
On December 15, 2023 Box experienced a severe service degradation. During this time, users experienced slowness and failures interacting with all parts of the Box webapp and public API; including Logins, Uploads and Downloads. Additionally, some customers might have seen small subsets of their content created during the incident become sporadically hidden.
The issue was triggered by a partial power loss that impacted persistent disks in one of the cloud infrastructure availability zones in which we operate, leading to multiple active databases being impacted. Although our database infrastructure is architected with zonal redundancy, the unique nature of the failure surfaced a gap in our zonal health monitoring and presented unforeseen challenges for our automatic recovery tooling, extending the resolution time. We were ultimately able to resolve the issue by manually shifting traffic out of the impacted availability zone.
Box’s database infrastructure is designed to be able to withstand problems impacting a single availability zone. Our database replicas are spread across multiple availability zones for redundancy and we have automated tooling in place that can detect faults and shift traffic to other replicas should that become necessary. Both the redundant replicas and automated tooling are exercised regularly as part of our normal operations.
In this case, when the partial power loss took place, it impacted several of our databases in a way that compromised their ability to serve high traffic volumes. Diagnosing this type of partial zonal failure was challenging due to the high variability of impact across multiple databases. Because of this unique confluence of factors, the root cause proved challenging to diagnose and mitigate both for our automation and incident responders, extending time to resolution. Once we confirmed the impact was limited to a single availability zone, we were able to effectively mitigate all active databases out of the impacted zone by leveraging other availability zones.
During the course of our mitigation efforts, replication topologies may have caused certain newly created files or file versions to intermittently not appear in a limited number of Box accounts. These intermittent visibility issues were resolved as replication topologies were normalized and caches were later cleared.
We are continuing to conduct a comprehensive engineering postmortem. Therefore, this report is subject to change based on our further analysis and findings.
The following corrective actions have been, or continue to be, implemented:
The above noted corrective actions will strengthen our efforts to safeguard against partial zone degradations, reduce mitigation timeframes and support enhanced testing and prevention efforts.
We are continuously working to improve Box and want to make sure we are delivering the best product and user experience we can. We hope we have provided some clarity here and we would be happy to answer any questions you may still have regarding this matter.
Sincerely,
The Box Team