Update: Details of fr-par VPC incident & response
On November 7 at 05:25 AM, Scaleway encountered an incident in the fr-par region that impacted customers using products in Private Networks with DHCP network configuration (i.e. Instances, Kapsule, Load Balancer, Public Gateway and Managed Database for PostgreSQL and MySQL). It was resolved by midday the same day. Here’s an important update on what happened.
The regional IPAM service (responsible for public and private IP Address Management) was unavailable for the duration of the incident. As a result, other products and services using the IPAM were unable to book and retrieve IP addresses, and gradually lost their private network configurations. This led to the loss of communications between these resources.
Timeline of the incident
A first internal alert was triggered at 05:28 CET, as one of 3 IPAM nodes was not able to respond to calls.
As calls were stacking up, other IPAM nodes started getting stuck, one after the other. By 06:24, all fr-par nodes were down despite the Multi-AZ design to prevent such a cascading event.
These alerts were not identified with a sufficient level of emergency and did not trigger our internal incident process which slowed down our communication and response time.
Resources attached to Private Networks were not able to renew their DHCP leases due to the IPAM shutdown, and they therefore progressively lost their network configuration over a period of time of up to one hour after all IPAM nodes went down.
These cascading events led to the incident process to be triggered, but late compared to the original root issue.
The IPAM healing process was launched shortly after the complete shutdown, and by 08:48 all IPAM nodes were up and running again. Most of the impacts were solved by then, and teams worked on specific edge cases in cascading products in the minutes following the end of the global incident.
Root cause and resolution of the issue
Early investigations found that IPAM threads were stuck due to a periodic task responsible for updating metrics. This task runs SQL queries, which ended up taking too long to perform. This blocked the scheduler and, in turn, the whole application, thereby freezing the API.
In order to fix the problem, we changed the code to run the task in a thread pool dedicated to blocking tasks without blocking the runtime.
The fix was tested and deployed by 12:00 PM the same day. This dramatically improved Postgres pool connection management and tail call latencies, and was closely monitored over the hours and days that followed. The behavior that led to the incident has not been observed since.
We’re already working on long term actions to improve reliability and to prevent such an incident leading to a cascade failure, in particular by implementing enhanced auto-healing capabilities and DHCP resilience.
Conclusion
IPAM is a brand new service but it has quickly become a very critical one due to the product span it integrates with. This incident has demonstrated that more fine-grained observability is needed in such cases, so we can debug faster, and proactively test new products' incident processes.
We hope this communication was useful and helped you understand what happened. We are continuously working on improving our services and sincerely apologize for any inconvenience you may have experienced.
Scaleway provides updates in real time of all of its services’ status, here. Feel free to contact us via the console, or social media, for any questions moving forwards. Thank you!