We recently addressed issues affecting JWT logins and Box Notes. We would like to take the opportunity to further explain these issues and the steps we have taken to keep them from happening in the future.
Between 5:37 AM and 6:39 AM PT, 9:37 PM and 10:39 PM PT on March 21, 2023, and between 9:12 AM PT and 9:39 AM PT on March 22, 2023 some users may have experienced difficulties while working in Box. During this time, we noticed some JWT-based logins resulting in failures as well as an increase in disconnects within Box Notes while users tried to save or delete their Notes. The issue occurred because of a configuration change on one of our caching servers, specifically in one of our maintenance scripts. We were able to resolve the issue by quarantining the affected host and manually replacing it with a healthy one. In addition, we have hardened our automated checks, added better controls in these maintenance scripts and also improved observability on the impacted caching services in order to to prevent similar issues from occurring in the future.
Logins at Box are rate limited in order to control the amount of traffic or requests that can be sent to our subsystems. During the time of the issue, the caching server that enables this rate limiting was shut down because of a misconfiguration in one of our maintenance scripts tasked with keeping our caching infrastructure up-to-date with the latest software patches. The script was trying to connect to its backend using an incorrect TLS certificate, causing the script to hang and consume system memory. Since this script runs on schedule, multiple instances of this script started over time causing it to consume more and more system resources. Eventually, this led to insufficient resources being available for Memcached service, causing it to be terminated. Additionally, a bug in our auto-healing system prevented this caching server to be replaced with a healthy one leaving the server in service even though it was unhealthy.
When part of the cache became unavailable, rate limiting service terminated the requests that were supposed to be cached in that server, causing approximately 30% of JWT logins to fail during that time. This same caching server was also used in the generation of OAuth2 tokens used internally within the BoxNotes service. When this cache would become unavailable, Box Notes would disconnect while trying to persist or delete a Note, since it could not validate the tokens. We noticed an increase of approximately 10% in the number of disconnects during the time of cache unavailability.
The following corrective actions have been completed or are planned:
We are continuously working to improve Box and want to make sure we are delivering the best product and user experience we can. We hope we have provided some clarity here and we would be happy to answer any questions you may still have regarding this matter.
The Box Team