We also maintain a list of Known Product Issues separate from this site here.
We recently addressed issues affecting Search. We would like to take the opportunity to further explain these issues and the steps we have taken to keep them from happening in the future.
Between 9:33 pm PDT on June 3, 2024 and 1:28 pm PDT on June 4, 2024 and between 6:38 pm PDT and 7:52 pm PDT on June 6, 2024, some users may have experienced difficulties while working in Box. During this time, users were intermittently unable to query Search or receive Search results. This issue impacted a very small number (approximately 2%) of enterprises. During the first period, the issue occurred as a result of a partially rolled out recent code change. This code change was part of our ongoing effort to improve performance and stability. During the second period, the issue occurred a result of a latent bug in the Search Shadow service, which was used to investigate the root cause of the initial issue. We were able to resolve the issue by fully rolling back the change in both cases. In addition, we have added alerting to detect partial rollouts of changes as well as tests in pre-production environments and addressed the latent bug in the Search Shadow service to prevent similar issues from occurring in the future.
Analysis
On June 3, 2024, Search released a change to how backend Search nodes are queried in order to improve performance and stability. This change had an unintended effect on query patterns that dramatically increased load for a small number of queries and only manifested at scale. Although the number of such queries was small, when a backend Search node processed them it would sporadically run out of memory and impact all traffic to that particular node. The Search release mechanism partially rolled out this change to a fraction of the fleet, but did not progress further. We initiated standard rollback procedures, but we did not detect that the change remained deployed on one of the nodes. This complicated the processes of diagnosing and mitigating the impact. The issue was mitigated when the partial rollback was detected and the change was fully rolled back.
On June 6, 2024, as part of the investigation to identify and verify the root cause of the initial issue, the Search team utilized a so-called Shadow service in production that does not service live traffic. However, this Shadow service contained a latent bug that allowed a small percent of queries to be issued against the live backend Search nodes. Because the issue that occurred on June 3, 2024 could be triggered with just a handful of queries, the live serving nodes were inadvertently impacted by this Shadow service. The Shadow service change was rolled back to mitigate.
Corrective Actions
The following corrective actions have been completed or are planned:
We are continuously working to improve Box and want to make sure we are delivering the best product and user experience we can. We hope we have provided some clarity here and we would be happy to answer any questions you may still have regarding this matter.
Sincerely,
The Box Team