For more information about our Incident Response and Communications please read this support article.

We also maintain a list of Known Product Issues separate from this site here.

[Major] Issues with Uploads, Downloads, Box Notes, API and Logins

Incident Report for Box

Postmortem

We recently addressed issues affecting the Box Service. We would like to take the opportunity to further explain these issues and the steps we have taken to keep them from happening in the future.

Between 10:24am PST and 11:20am PST on February 6th, 2023, some users may have experienced difficulties while working in Box. During this time, a subset of Box’s external load balancers became unstable due to extremely high CPU usage, which resulted in intermittent issues for customers trying to access the Box Service. The issue occurred as a result of network socket exhaustion on external load balancers which was caused by a migration of certain traffic to the public cloud. We were able to resolve the issue by rolling back the migration at issue as well as increasing available load balancer capacity. In addition, we are working on updating the required amount of resources and capacity we need in the public cloud for these migrations as well as improving our monitoring to prevent similar issues from occurring in the future. 

Analysis

As Box moves to the public cloud, we have migrated multiple workloads from our on-prem datacenters to the public cloud. The service in question is responsible for an extremely high connection count. While moving this service away from the last on-prem datacenter it occupied, the load balancers in the public cloud became temporarily overwhelmed. This resulted in the exhaustion of available network sockets for the load balancers to use. This drove up CPU usage as a result of connection retries, eventually causing instances to fail healthchecks and be recreated. As instances were being recreated, they would drop traffic and move the high connection count to a different instance, further compounding the issue. We were able to resolve this issue by rolling back the migration, which relieved the connection counts on the load balancers. Furthermore, by adding more capacity we were able to provide the necessary headroom to recover completely. We are working on adding capacity to our cloud load balancing stack and updating our monitoring to immediately point to the root cause if a similar issue is encountered in the future.

Corrective Actions

The following corrective actions have been completed or are planned:

  • Add additional capacity to the public cloud load balancing stack
  • Update existing monitoring to account for network socket usage and exhaustion

We are continuously working to improve Box and want to make sure we are delivering the best product and user experience we can. We hope we have provided some clarity here and we would be happy to answer any questions you may still have regarding this matter. 

Sincerely,

The Box Team

Posted Feb 09, 2023 - 10:23 PST

Resolved

This incident has been resolved.
Posted Feb 06, 2023 - 11:54 PST

Monitoring

A fix has been implemented and we are monitoring the results.
Posted Feb 06, 2023 - 11:20 PST

Investigating

We are investigating an ongoing issue affecting uploads, downloads and Notes and logins. We will provide more information as soon as it is available.
Posted Feb 06, 2023 - 10:44 PST
This incident affected: Box Platform / API (Content API, Uploads/Downloads, Box Sign), Box Web Application (Login/SSO, Uploads/Downloads, Box Sign), Desktop Applications (Login/SSO), Mobile Applications (Login/SSO, Uploads/Downloads), and Box Notes (Web Application).