We recently addressed issues affecting folder and collection browsing, and related APIs. We would like to take the opportunity to further explain these issues and the steps we have taken to keep them from happening in the future.
Between 05:22 PM PDT and 07:04 PM PDT on May 6, 2026, some users may have experienced difficulties while working in Box. During this time, users reported failures and elevated errors when loading folders and collections, intermittent service errors, and degraded AI-related functionality.
The issue was caused by runtime saturation in a backend API service that had been gradually approaching its capacity limits over several weeks. During peak traffic hours, multiple service instances simultaneously became unable to respond to internal health checks within the configured timeout. As those instances were restarted, traffic shifted to the remaining instances, increasing load on them and creating a cascading failure. We mitigated the issue by stopping the restart cycle, routing traffic to alternate paths, and increasing service capacity.
Analysis
This incident revealed several contributing causes:
- Insufficient scaling signals — autoscaling was primarily configured around CPU utilization. For this service, CPU utilization did not accurately reflect runtime saturation, so the fleet did not scale early enough in response to the actual bottleneck.
- Health check sensitivity under load — health checks were configured with timeouts that did not account for the service’s behavior under sustained runtime saturation. Once instances became slow to respond, restarts amplified the problem by shifting more traffic to the remaining instances.
- Gradual capacity erosion without alerting — the service had shown intermittent health check failures during prior peak traffic periods. Those earlier events were narrow enough to self-recover, but they were not surfaced as a clear trend requiring action.
- Retry amplification — as service instances became unavailable, upstream retry behavior added additional load to the remaining healthy instances, accelerating the cascade.
This incident highlighted opportunities for improvement in how we detect and respond to runtime saturation that is not always visible from CPU metrics alone. It also reinforced the need for stronger end-to-end observability, clearer health-check visibility, and validated mitigation paths for service instability.
Corrective Actions
Box has initiated the following corrective actions:
- Improved runtime monitoring and alerting — we added alerts for runtime saturation metrics that detect this class of failure independently of CPU utilization, providing earlier warning before customer impact occurs.
- Revised autoscaling strategy — we reconfigured autoscaling to trigger on signals that reflect actual service capacity and validate that the fleet scales correctly under realistic load conditions.
- Health check and resilience tuning — we are reviewing health check configurations and retry behavior to reduce the risk that transient runtime saturation can turn into cascading restarts.
- Capacity validation and load testing — we are establishing load testing against production-representative traffic patterns and maintaining capacity headroom until the service's efficiency improvements are validated.
- Runtime performance improvements — we identified and are implementing changes to the service framework that significantly reduce per-request overhead, providing more headroom against future saturation.
We are continuously working to improve Box and want to make sure we are delivering the best product and user experience we can. We hope we have provided some clarity here and we would be happy to answer any questions you may still have regarding this matter.
Sincerely,
The Box Team