For more information about our Incident Response and Communications please read this support article.

We also maintain a list of Known Product Issues separate from this site here.

Maintenance on WebDav
Scheduled Maintenance Report for Box
Postmortem

We recently addressed issues affecting JWT logins and Box Notes. We would like to take the opportunity to further explain these issues and the steps we have taken to keep them from happening in the future.

Between 5:37 AM and 6:39 AM PT, 9:37 PM and 10:39 PM PT on March 21, 2023, and between 9:12 AM PT and 9:39 AM PT on March 22, 2023 some users may have experienced difficulties while working in Box. During this time, we noticed some JWT-based logins resulting in failures as well as an increase in disconnects within Box Notes while users tried to save or delete their Notes. The issue occurred because of a configuration change on one of our caching servers, specifically in one of our maintenance scripts. We were able to resolve the issue by quarantining the affected host and manually replacing it with a healthy one. In addition, we have hardened our automated checks, added better controls in these maintenance scripts and also improved observability on the impacted caching services in order to to prevent similar issues from occurring in the future. 

Analysis

Logins at Box are rate limited in order to control the amount of traffic or requests that can be sent to our subsystems. During the time of the issue, the caching server that enables this rate limiting was shut down because of a misconfiguration in one of our maintenance scripts tasked with keeping our caching infrastructure up-to-date with the latest software patches. The script was trying to connect to its backend using an incorrect TLS certificate, causing the script to hang and consume system memory. Since this script runs on schedule, multiple instances of this script started over time causing it to consume more and more system resources. Eventually, this led to insufficient resources being available for Memcached service, causing it to be terminated. Additionally, a bug in our auto-healing system prevented this caching server to be replaced with a healthy one leaving the server in service even though it was unhealthy.

When part of the cache became unavailable, rate limiting service terminated the requests that were supposed to be cached in that server, causing approximately 30% of JWT logins to fail during that time. This same caching server was also used in the generation of OAuth2 tokens used internally within the BoxNotes service. When this cache would become unavailable, Box Notes would disconnect while trying to persist or delete a Note, since it could not validate the tokens. We noticed an increase of approximately 10% in the number of disconnects during the time of cache unavailability.

Corrective Actions

The following corrective actions have been completed or are planned:

  • Validate that all necessary TLS certificates are present on the entire caching fleet.
  • Fix the maintenance script so in case of a connection error, it terminates and does not hang and hold on to memory.
  • Fix the bug in the auto-healing system that prevented the unhealthy host from being replaced.
  • Add more specific alerts in case of failures like these, so on-call engineers are better informed.

We are continuously working to improve Box and want to make sure we are delivering the best product and user experience we can. We hope we have provided some clarity here and we would be happy to answer any questions you may still have regarding this matter. 

Sincerely,

The Box Team

Posted May 17, 2023 - 09:45 PDT

Completed
Maintenance on our WebDAV services have been completed. Users may be able to leverage the WebDAV service as expected.
Posted Mar 22, 2023 - 15:06 PDT
Scheduled
On Wednesday March 22nd, Box has performed maintenance on our WebDAV services. The maintenance is expected to conclude on Thursday, March 23rd. During this window, users may not be able to leverage the WebDAV service. We will also closely monitor the status of the Box Services throughout the maintenance and immediately thereafter. Updates on the maintenance, as well as, any changes in status will be shared through our status site.
Posted Mar 22, 2023 - 12:25 PDT