We recently addressed issues affecting Box API. We would like to take the opportunity to further explain these issues and the steps we have taken to keep them from happening in the future.
Between 10:00 PM PST and 10:23 PM PST on April 24, 2024, some users may have experienced difficulties while working in Box. During this time, some Box customers might have experienced elevated error rate or increased latency with Box API requests. The issue occurred due to an unexpected failure during a MySQL Instance failover that was executed in response to high load on one of the database instances. We were able to resolve the issue by manually executing the instance failover. In addition, we have implemented a new discovery mechanism that improves the time to detect and propagate a MySQL topology change to prevent similar issues from occurring in the future.
Analysis
During the incident, a failover was executed for a MySQL leader that had high CPU usage. The failover process failed to propagate the topology change and caused the impact to Box API success rate. The issue was remediated by executing the rest of the failover steps manually and the impact recovered once the new topology was successfully propagated to other components. The root cause of high CPU usage was later determined to be organic increase in week over week traffic and the impacted instance was upsized to add more CPU cores.
Corrective Actions
The following corrective actions have been completed or are planned:
We are continuously working to improve Box and want to make sure we are delivering the best product and user experience we can. We hope we have provided some clarity here and we would be happy to answer any questions you may still have regarding this matter.
Sincerely,
The Box Team