For more information about our Incident Response and Communications please read this support article.

We also maintain a list of Known Product Issues separate from this site here.

[Critical] Issue with Multiple Box Services

Incident Report for Box

Postmortem

We recently addressed issues affecting Box services. We would like to take the opportunity to further explain these issues and the steps we have taken to keep them from happening in the future.

On February 6, 2025, between 5:00 PM PST and 7:00 PM PST, Box conducted a planned zonal resilience test as part of ongoing efforts to enhance system reliability. During the test, all traffic from one active service zone was diverted to other healthy zones. As a result, some users may have experienced slowness, login failures, or other difficulties while using Box. The impact lasted until 7:04 PM PST, extending four minutes beyond the planned maintenance window. During the test, some backend services experienced temporary overloading, making the impact more severe than anticipated. We mitigated the issue by redistributing traffic across all active zones.

Analysis

During a planned zonal resilience test, all traffic from one active service zone was diverted to other zones. However, one zone unexpectedly experienced a disproportionate increase in traffic compared to the others. While the backend services in that zone generally had enough capacity to handle the load, a cache instance became overloaded due to a key that was accessed significantly more than others. This overload triggered a ripple effect, impacting database health, which in turn caused requests to pile up at the Edge layer, ultimately leading to request rejections. During the test, we successfully identified and root-caused the issue, mitigating it by ending the test and redistributing traffic more evenly across all active zones.

The issue revealed the following areas of improvements:

  • Insufficient monitoring for single-instance failures during resilience testing.
  • Uneven traffic distribution among zones, leading to localized overload.
  • Lack of mechanisms to detect and prevent key issues from overloading cache instances.

Corrective Actions

Box has initiated the following corrective actions:

  • Enhancing zonal resilience mechanics to ensure more balanced traffic distribution during a zonal failure.
  • Improving monitoring to detect single-instance failures more effectively during resilience testing.
  • Optimizing cache access patterns to prevent key issues from overloading cache instances.

We are continuously working to improve Box and want to make sure we are delivering the best product and user experience we can. We hope we have provided some clarity here and we would be happy to answer any questions you may still have regarding this matter. 

Sincerely,
The Box Team

Posted Feb 24, 2025 - 11:15 PST

Resolved

After further monitoring, this incident is now considered resolved. All services have been restored to full functionality. If you continue to experience any issues, please contact Box Support at https://support.box.com.
Posted Feb 06, 2025 - 20:33 PST

Monitoring

A fix has been implemented and we are monitoring the results. All services were impacted due to the Scheduled Resiliency Test Maintenance. Users may have experienced the sidebar not loading on the Box web application which was also anticipated as an impacted service, though occurred outside of the test maintenance window.
Posted Feb 06, 2025 - 20:13 PST

Update

We are continuing to investigate this issue.
Posted Feb 06, 2025 - 19:31 PST

Update

We are continuing to investigate this issue.
Posted Feb 06, 2025 - 19:18 PST

Investigating

We are investigating an ongoing issue affecting multiple Box services. We will provide more information as soon as it is available.
Posted Feb 06, 2025 - 19:10 PST
This incident affected: Box Platform / API (Content API) and Box Web Application (Login/SSO).