For more information about our Incident Response and Communications please read this support article.

We also maintain a list of Known Product Issues separate from this site here.

[Major] Issues with Public API and Box Notes
Incident Report for Box
Postmortem

We recently addressed issues affecting JWT logins and Box Notes. We would like to take the opportunity to further explain these issues and the steps we have taken to keep them from happening in the future.

Between 5:37 AM and 6:39 AM PT, 9:37 PM and 10:39 PM PT on March 21, 2023, and between 9:12 AM PT and 9:39 AM PT on March 22, 2023 some users may have experienced difficulties while working in Box. During this time, we noticed some JWT-based logins resulting in failures as well as an increase in disconnects within Box Notes while users tried to save or delete their Notes. The issue occurred because of a configuration change on one of our caching servers, specifically in one of our maintenance scripts. We were able to resolve the issue by quarantining the affected host and manually replacing it with a healthy one. In addition, we have hardened our automated checks, added better controls in these maintenance scripts and also improved observability on the impacted caching services in order to to prevent similar issues from occurring in the future. 

Analysis

Logins at Box are rate limited in order to control the amount of traffic or requests that can be sent to our subsystems. During the time of the issue, the caching server that enables this rate limiting was shut down because of a misconfiguration in one of our maintenance scripts tasked with keeping our caching infrastructure up-to-date with the latest software patches. The script was trying to connect to its backend using an incorrect TLS certificate, causing the script to hang and consume system memory. Since this script runs on schedule, multiple instances of this script started over time causing it to consume more and more system resources. Eventually, this led to insufficient resources being available for Memcached service, causing it to be terminated. Additionally, a bug in our auto-healing system prevented this caching server to be replaced with a healthy one leaving the server in service even though it was unhealthy.

When part of the cache became unavailable, rate limiting service terminated the requests that were supposed to be cached in that server, causing approximately 30% of JWT logins to fail during that time. This same caching server was also used in the generation of OAuth2 tokens used internally within the BoxNotes service. When this cache would become unavailable, Box Notes would disconnect while trying to persist or delete a Note, since it could not validate the tokens. We noticed an increase of approximately 10% in the number of disconnects during the time of cache unavailability.

Corrective Actions

The following corrective actions have been completed or are planned:

  • Validate that all necessary TLS certificates are present on the entire caching fleet.
  • Fix the maintenance script so in case of a connection error, it terminates and does not hang and hold on to memory.
  • Fix the bug in the auto-healing system that prevented the unhealthy host from being replaced.
  • Add more specific alerts in case of failures like these, so on-call engineers are better informed.

We are continuously working to improve Box and want to make sure we are delivering the best product and user experience we can. We hope we have provided some clarity here and we would be happy to answer any questions you may still have regarding this matter. 

Sincerely,

The Box Team

Posted Apr 04, 2023 - 06:30 PDT

Resolved
Our team has worked on the backlog to alleviate the degradation with API Authentication and Box Notes. We are considering this issue to be resolved. If you are still experiencing any issues, please let us know at https://support.box.com.
Posted Mar 21, 2023 - 23:10 PDT
Investigating
On 21st of March 2023 at approximately 9:35PM PST, we started investigating an ongoing issue affecting API Authentication. Users may also experience issues when trying to create or delete Box Notes. We will provide more information as soon as it is available.
Posted Mar 21, 2023 - 22:48 PDT
This incident affected: Box Notes (Web Application) and Box Platform / API (Authentication (OAuth 2.0 / JWT)).