Update: Block Storage incident on AMS-1

20/12/234 min read

On December 8th at 09:45 AM, Scaleway encountered an incident in the NL-AMS-1 Availability Zone that impacted customers using products in this AZ. It was resolved by 02:10 PM the same day. Here’s an important update on what happened.

Block Storage product faced an issue and, as a result, other products based on it (i.e. Instances, Kapsule, Load Balancer, Managed Databases, …) faced either high latency or unavailability.

Global unavailability: 2h40
Impacts on the platform (latency,unavailability,etc.): 5h20

Context

Our Block Storage product is based on the software defined storage Ceph, mixed with our own APIs to manage all products requests.

These APIs have two main roles: managing our own infrastructure and adding more security to it.

We were performing a critical security update on our NL-AMS-1 block storage cluster to strengthen it before our “freeze” period.

These updates were already done on several of our Ceph clusters (preproduction and production as well) without any impact, which led us to perform this update in NL-AMS-1 cluster. This update turned out to be at the root of the problem.

Timeline of the incident

We planned an intervention on our Block Storage platform in NL-AMS-1 on Thursday 7th of December at 3PM. This intervention was meant to update our Ceph version with more recent security updates.

We started with updating a first server that we monitored for 2 hours and found no error. Then we started updating all other servers, which took us the whole night. We kept monitoring it in the early morning, without any issue.

On Friday 8th of December at 9.40AM, we started seeing a load increase in the cluster load with minor impacts on response times from our point of view. The situation was stable and impacts were getting lower from our point of view.

At 11AM we were alerted of high latencies on our Block Storage and immediately created a major incident. Public Status was created at 11.36AM due to some delay in internal communications.

From then on, we faced several crashes on our servers. All of them were going Out Of Memory although the global load on the platform was the same as the last few days.

Our experts identified the issue at 11.45AM. Tuning settings were set up on our cluster that were not our default ones.

Applying fixes took time and required to be applied on all servers progressively.

At 1.40PM, Block Storage was back up and stable. There was some minor impacts on performances due to load balancing because of the application of the updated settings.

All our teams (Instance, DB, K8S, etc.) worked on getting their service back after that.

They also kept monitoring our infrastructure until the end of the day, performing actions to ensure it functioned properly.

We kept a close eye on our Block Storage infrastructure during the whole weekend to ensure there were no further issues.

Root cause and resolution of the issue

During our investigation, we quickly concluded that the issue was not related to the update. The procedure had already been applied on our staging cluster and other productions AZ with no side effects.

We found out that there was a misconfiguration on our Ceph cluster that was not applied on any other AZ.

The teams responsible for this operation were also not aware of these changes (that should be done through our automatic configuration tool). This topic is still under investigation and will lead to many improvements of our management processes.

Also, some HW related issues appeared during issues on this generation of Block Storage that hindered the resolution time.

Our new ‘Low Latency’ offers, based on a new generation of hardware, were not impacted during this incident and showed no downtime.

Conclusion

Block Storage is a key product of our ecosystem and must be resilient. We are all working to improve its resiliency and will keep on doing it (automatization processes to manage our platforms, keeping our infrastructures up to date) but also our communication processes in case of an incident. This incident will help us enhance that.

Please also note that we have new offers (Low Latency) that are designed with new hardware and an even better resiliency. They are currently in Public Beta.

These Low Latency offers provide 2 levels of performance (IOPS 5K and 15K) and improved response time.

They are available through our brand new API / user journey / devtools and are already compatible with Instance, Kapsule (on new clusters only, with a specific CSI version — more information on our Slack Community), and DBaaS (cost optimized offers). The AZ available are limited right now but more will come in the coming months : FR-PAR-1, FR-PAR-2, NL-AMS-1, NL-AMS-3, PL-WAW-3.

You can already try them and enjoy the 50% discount during the public beta (already applied, until the 1st of February).