Update: Scaleway Object Storage incident across September & October 2024

08/11/245 min read

Incident Overview

Between September, 24 and October, 23 2024, Scaleway Object Storage encountered a period of increased instability in the FR-PAR region following architectural upgrades aimed at enhancing performance and scalability. During this time, some customers experienced elevated error rates and increased upload latency.

Incident status links:

Regions Impacted: FR-PAR
Duration: 7 days until mitigation, 1 month to get back to nominal performances
Primary Impact: Instability and increased latency on S3 upload performance, affecting certain users.

Summary of Events

In September 2024, we initiated a series of infrastructure improvements for Scaleway Object Storage, including the migration of PUT methods to our new Object Storage gateway and the deployment of an upgraded load-balancing architecture in the FR-PAR region. These changes were designed to reduce latency and improve scalability, as successfully observed in prior rollouts in other regions (NL-AMS and PL-WAW).
However, after scaling up the deployment in the FR-PAR region on September 23, we observed an unexpected increase in 503 errors, indicating instability. Initial analysis showed that the FR-PAR region's higher load conditions made it particularly susceptible to unforeseen issues, despite thorough monitoring during earlier deployments. The migration could not be reverted due to the architectural complexity of the update, leading to a longer mitigation delay than usual.

Incident Timeline

September 18, 2024: Migration of PUT methods to the new Object Storage Gateway completed in FR-PAR
September 23, 2024: New load balancer architecture deployed in FR-PAR to handle increased request volumes
September 25, 2024: Initial incident opened following increased 503 errors. A patch was deployed but did not fully resolve the issue
September 28, 2024: Second incident opened. Another patch deployed but rolled back due to unintended side effects
September 30, 2024: Final mitigation patch deployed, temporarily stabilizing the service but causing a slight increase in latency
October 4, 2024: Partial hardware mitigation implemented in FR-PAR, yielding significant performance improvements
October 7-8, 2024: Additional upgrades were made to the new load balancers servers (upgrading from 64GB to 512GB of RAM) to resolve memory-related issues
October 23, 2024: Full deployment of the long-term fix across all impacted regions, restoring performance to optimal levels.

Root Cause Analysis

Increased Load on FR-PAR: The unique conditions in the FR-PAR region, particularly higher request loads, revealed an unexpected sensitivity in our infrastructure that was not observed during earlier regional deployments
Memory Limitations: New load balancer servers in FR-PAR were initially provisioned with 64GB of RAM, which proved insufficient under suddenly higher traffic conditions, leading to memory exhaustion and early-termination of processes
Connection Management: Issues with HTTP Keep-Alive timeout settings between our new Gateway and load balancers led to inefficient handling of some client requests, exacerbating latency issues
Patch and Rollback Challenges: Although multiple patches were quickly developed, early solutions had to be rolled back due to unintended side effects. Also, no rollback was possible for the initial architectural upgrades, prolonging the resolution.

Impact on Customers

During this period, customers in the FR-PAR region may have observed:

Elevated 503 errors and occasional request failures, particularly during peak hours
Increased latency on object uploads, with temporary performance degradation.
Customers were advised to implement retries on failed requests to mitigate the impact, as further optimizations were implemented.

Resolution and Improvements

The resolution involved:

Memory Upgrades: New load balancers servers in FR-PAR were upgraded from 64GB to 512GB, significantly improving stability under high loads
Enhanced Connection Management: HTTP Keep-Alive settings were fine-tuned between the Object Storage gateway and load balancers, which improved response times and connection stability
Improved Fault Tolerance: A new upload mechanism was developed, enhancing the fault tolerance of PUT operations, particularly in handling intermittent errors.

These improvements culminated in a full resolution of the incident on October 23, 2024. Performance gain was confirmed in FR-PAR compared to before the architectural upgrades that had triggered the incident. Customer feedback quickly confirmed satisfaction with the overall optimization of the service.

Customer Support and SLAs

Despite this incident, we maintained overall SLA compliance for September (99.93% uptime against a 99.0% SLA target for single-zone and 99.90% for multi-AZ configurations). October overall SLA was not deteriorated by the incident.

Next Steps and Continuous Improvement

This incident has highlighted areas where we can enhance both our infrastructure and our processes. As part of our commitment to continuous improvement, we are:

Strengthening our monitoring and alerting to detect similar issues earlier in the deployment cycle
Implementing a more robust change-management process to improve rollback options for complex architectural upgrades
Exploring advanced deployment methods, including blue-green deployments
Improving external communication before impacting production deployment (maintenance windows).

We remain dedicated to delivering reliable and performant Object Storage services to all our customers and thank you for your understanding as we continue to make improvements based on the lessons learned from this incident.

Scaleway provides updates in real time of all of its services’ status, here. Feel free to contact us via the console for any questions moving forwards. Thank you!