For more information about our Incident Response and Communications please read this support article.

We also maintain a list of Known Product Issues separate from this site here.

[Minor] Issues with Uploads, Downloads, Public API, Box Sign and Box Notes
Incident Report for Box
Postmortem

We recently addressed issues affecting the Box webapp. We would like to take the opportunity to further explain these issues and the steps we have taken to keep them from happening in the future.

Between 5:00 pm PT on April 16, 2023 and 1:00 am PT on April 17, 2023, as well as between 4:30 pm PT on April 17th and 2:00 am PT on April 18, 2023, some users may have experienced difficulties while working in Box. During this time, some users experienced slowness and intermittent failures interacting with certain parts of the Box webapp and public API, including while applying Box Shield Classifications and sending documents for signature via Box Sign. The issue occurred as a result of a gradual new feature rollout that overwhelmed a cache held inside one of the key middleware services. We were able to resolve the issue by deploying a code change that prevented the feature rollout framework from overwhelming the cache and restarting the impacted service to clear the cache. In addition, we are working on improving monitoring and alerting around the impacted cache to be able to detect and address similar problems before they impact our users in the future.

Analysis

Our investigation shows that the incident was caused by an inefficient configuration of the experimentation framework that was used to begin gradual rollout of a new feature. This framework unexpectedly stored decision results in the cache, regarding whether the feature should be shown to the user. This decision storage overran the cache, causing other entries to be evicted. This meant that the web application ended up having to reconstitute the frequently used and expensive to compute information that would have ordinarily been in this cache. The effective capacity of our web application processing was thus dramatically reduced, which caused exhaustion and subsequent degradation of several of our latency sensitive services. The incident remediation was slowed down significantly by the lack of explicit alerting on the web application server’s cache.

Corrective Actions

The following corrective actions have been completed or are planned:

  • Fix the experimentation framework such that it doesn’t overwhelm the web application server’s cache
  • Add explicit monitoring and alerting around the usage of the web application server’s cache
  • Implement a fast way to clear the web application server caches on the entire server fleet
  • Add explicit monitoring and alerting around expensive operations performed by the web application server that should ordinarily be performed rarely due to being cached effectively

We are continuously working to improve Box and want to make sure we are delivering the best product and user experience we can. We hope we have provided some clarity here and we would be happy to answer any questions you may still have regarding this matter. 

Sincerely,

The Box Team

Posted May 01, 2023 - 08:36 PDT

Resolved
After further monitoring, this incident is now considered resolved. the Box Content API and Classification services have been restored to full functionality. If you continue to experience any issues, please contact Box Support at https://support.box.com.
Posted Apr 17, 2023 - 12:59 PDT
Monitoring
Our team has taken steps to remediate the issues with the Content API and Classification services. We are continuing to monitor for any additional impact.
Posted Apr 17, 2023 - 10:57 PDT
Investigating
We are investigating an ongoing issue affecting the Content API, and manual Classification changes. Users may experience latencies with the Content API, as well as intermittent errors when manually editing Classifications. We will provide more information as soon as it is available.
Posted Apr 17, 2023 - 10:10 PDT
Update
We are continuing to monitor for any additional impact to other services. Users may still see Content API latencies. We will provide an update once Content API latencies are remediated.
Posted Apr 17, 2023 - 08:35 PDT
Monitoring
Our team has taken steps to remediate this issue and is seeing improvement on Box Sign service. We haven't observed any failed tests on Box Sign since 1:14 AM PST. We are continuing to monitor for any additional impact.
Posted Apr 17, 2023 - 02:46 PDT
Update
Our team is still investigating on this issue. However, we observed improvments. Uploads / Downloads and Box Notes have recovered. We're still monitoring Box Sign and Content API.
Posted Apr 17, 2023 - 02:03 PDT
Update
We are continuing to investigate this issue.
Posted Apr 16, 2023 - 22:00 PDT
Update
We are continuing to investigate this issue.
Posted Apr 16, 2023 - 20:23 PDT
Update
We are continuing to investigate this issue.
Posted Apr 16, 2023 - 19:13 PDT
Investigating
We are investigating an ongoing issue affecting uploads, downloads, public API, Box Sign and Box Notes. We will provide more information as soon as it is available. Users may be experiencing increased latency.
Posted Apr 16, 2023 - 18:18 PDT
This incident affected: Box Platform / API (Content API).