We also maintain a list of Known Product Issues separate from this site here.
We recently addressed issues affecting Box Webapp, Public and Uploads & Downloads. We would like to take the opportunity to further explain these issues and the steps we have taken to keep them from happening in the future.
Between 4:30 AM PST and 5:05 AM PST on December 15, 2023, some users may have experienced difficulties while working in Box. During these times, some users may have experienced slowness and failures interacting with parts of the Box Webapp and public API, including Uploads and Downloads. The issue occurred due to an issue with the deployment of a new DB Access Control service on our database fleet. We were able to resolve the issue by temporarily disabling the problematic service. In addition, we are working to improve our testing and rollout process for services being deployed to our database fleet to prevent similar issues from occurring in the future.
The DB Access Control service is a service that enables new user access controls to be applied automatically and frequently. This allows for faster development of new DB functionality that requires access control changes, resulting in faster team velocity and ultimately more stable and reliable infrastructure.
On the morning of December 13th, an operator started the rollout of the DB Access Control service by deploying it to a single database pod. At that time, it was configured to execute its work every 10 minutes. Approximately 10 minutes after the deployment, we saw degradation on that pod, causing the impact seen on December 13. The operator suspected that the service frequency was the cause and remediated by configuring the service to execute its work once a day at a low-traffic time. The next steps should have been to validate the changes in a way that would not result in customer impact and then deploy it again on a single pod. However, the standard process was unintentionally not adhered to in this case and the change was instead deployed across the fleet, leading to additional impact as seen on December 15.
In addition to optimizing the DB Access Control service to be less resource-intensive, we intend to make some process changes in response to this issue. There is already a general standard process that should be followed when deploying services to the Database fleet. However, to minimize the likelihood of similar situations occurring again, we are updating our documentation to better ensure that all operators have a comprehensive understanding of the standard process and that such process is consistently followed.
The following corrective actions have been completed or are planned:
We are continuously working to improve Box and want to make sure we are delivering the best product and user experience we can. We hope we have provided some clarity here and we would be happy to answer any questions you may still have regarding this matter.
Sincerely,
The Box Team