Update: Details of “Important loss of connectivity in VPC on fr-par region” incident & response
On September 1st, 2024, at 08:27 UTC, Scaleway experienced a major VPC incident affecting the fr-par-1 and fr-par-2 availability zones. The impacts ranged from general networking instability, to DNS and DHCP failures. It was resolved by 12:46 UTC the same day.
This is a story of snowball effects, and of how BGP (Border Gateway Protocol), a protocol built for reliability, can let you down after all.
A primer on Scaleway's infrastructure
Our infrastructure is built on a set of standard technologies, using off-the-shelf software. However, we don't use any turnkey solutions. This includes our virtual networking stack.
We already communicated a while back about it: the core of a Scaleway AZ (Availability Zone) is an IP fabric, built using VXLAN and BGP-EVPN. VPC is simply another overlay network atop the same IP Fabric underlay, also leveraging VXLAN and BGP-EVPN. There is, however, one key difference: VPC does BGP to the hypervisor.
Basic VPC architecture. Note that the IP fabric itself has its own set of route-reflectors, and runs BGP between the leaves, the spines and the backbone. It’s BGP all the way down.
BGP is used to propagate layer 2 reachability information from one hypervisor to the other. When you send a packet from an instance to another one, it is thanks to BGP that the hypervisor knows where to send it.
In the above schematic something already surfaces: VPC's route-reflectors have an order of magnitude more BGP peers than those for the IP fabric. A single leaf pair can hold tens of hypervisors. This obviously leads to scaling and reliability issues on those route-reflectors that would need to handle thousands of BGP sessions in large zones:
-
Having that much responsibility makes the route-reflector highly critical. Too critical. Its blast radius in case of failure is too high.
-
Performance problems can quickly arise, as a single route needs to be propagated to all the other peers. Basically, every byte in can lead to kilobytes out.
The latter was the main technical hurdle that led us to shard our hypervisors in groups, called "fabrics", and to dedicate a route-reflector for each fabric, and make those communicate. This way, each one only handles half of the hypervisors, while still sharing routes.
Each shard can (and does) scale to several hundred hypervisors. Please note that from this point onwards, we won’t talk anymore about the IP fabric.
And then, even two pairs were not enough. We added a third pair. We added a fourth pair, continuing the mesh between the route-reflector clusters. Ultimately, we had to make a CLOS topology with our route-reflectors when fr-par-1 grew its 6th route-reflector cluster, in order to avoid further amplification. We now have "spine" and "leaf" route-reflectors.
fr-par-1 is now pretty large, and we have CLOS at every level now. As before, each shard can host hundreds of hypervisors, the biggest reaching 800 at its peak.
We'll come back to the hand waived "rest of the region" part of the schematic.
A detour on BGP software stacks
For a while, we’ve used multiple BGP implementations, both open-source and proprietary. We started out VPC using open-source BGP software on the hypervisors; and proprietary BGP virtual appliances for the route reflectors, which is basically a backbone router’s stack in a VM. This choice was driven by the robustness of this BGP implementation, powering large swaths of the internet on their platform; it is well known and battle tested; and it already powers our IP Fabric. Yet, we quickly found out it was not robust enough for our use-case.
Enter fr-par-1. Our oldest, biggest availability zone. A few years ago, when we hit the 500 hypervisors mark there, we had issues. The zone already had a somewhat large amount of VPC routes, but nothing too dramatic. If it were not for the 500 BGP sessions.
We started to see flaps (1) left and right.
The vendor’s BGP stack and TCP stack could not keep up, peers had to wait too long for information to propagate, gave up, and reset the session to start from scratch.
Restart from scratch? Well, yes. Tearing down a session and starting from scratch is BGP's main (if only) error handling mechanism. One of the cases where a BGP speaker will reset the session is when its peer does not respect the BGP protocol. And handling messages fast enough is one part of the protocol: peers send periodic keepalive messages, and expect an answer in a timely manner. Speakers also expect peers to read their incoming message queue often enough, in a timely manner.
In BGP terms, this is the "BGP hold timer", and its main purpose is to detect crashed peers, and to not waste time with them.
It turns out our route-reflector was overloaded and could not handle messages quickly enough, letting the hold timers expire left and right, leading to session flaps. This was our first, historic, large-scale, incident on VPC.
From this day onwards, we knew the appliance could not handle the load, and we sharded it. And gave it massive resources to help it keep up. And when scaling up VPC, we kept looking out for better solutions, and got very interested in Free Range Routing, or FRR. FRR is an open-source suite of networking daemons we already use a lot at Scaleway for its integration with the Linux kernel.
We now use FRR for our route-reflectors, and we eventually made the switch early 2024 for our new deployments. However, we are still running the proprietary software everywhere, hoping to do a progressive migration leveraging the natural lifecycle of hypervisors. Only a few small clusters were running FRR, as well as our spine route-reflectors, shown in yellow in the previous schematic.
It all begins with a blip
On Sunday, September the 1st, a few sessions between spine and leaf route-reflectors flapped, at 08:26:02 UTC. While a flap is not a normal event and it should be inspected, it should not have any impact. And at first, they had none, while they were between FRR route-reflectors.
But it generated a lot of BGP traffic, due to the withdrawals immediately followed by the updates of the whole RIB (3).
A minute later, at 08:27:41 UTC, chaos ensued. The influx of noise reached the biggest appliance-based cluster, with 800 hypervisors. Its sessions with spine route-reflectors went down. In less than 5 seconds, we had lost our three appliance-based clusters, powering 90% of the zone.
From this moment onwards, fr-par-1’s private networking became highly unstable. Route-reflectors kept flapping indefinitely, never being able to fully restore a session before being shut down by their peers (remember that the appliances are very slow at de-queuing updates?).
State of the various sessions between leaf and spine route-reflectors. In reality, there were far more flaps than indicated here due to scraping resolution.
Message rate per session (left) and overall per spine route-reflector (right).
During the incident, we went from our usual rate of a handful of routes/s per peer to several thousands. Spikes are when route-reflectors try to restore sessions and receive the initial state. Route-reflectors across the board had three orders of magnitude their usual load.
Clients noticed, Scaleway products relying on VPC started to notice, and the situation snowballed to the whole ecosystem.
Identification of the source of the incident took a while for a combination of factors. Due to a currently ongoing large-scale rework of our alerting and dashboards (to allow faster response time and better communication in the long run), some confusion in the correct dashboards to use slowed down the analysis of the root cause of the problem. Turns out, it was BGP. It had been BGP all along.
At 12:37 UTC, we shut down the sessions between leaf route-reflectors and all spine except one, hoping it would reduce the load enough for this single session to come back online, and restore stability without redundancy.
It was not enough, due to our old friend, the hold timer. Even with a single session to load, the route-reflector could not do it fast enough, and was shut down by the remaining spine. After increasing the timers a lot, the session finally came back up, and stayed that way for more than two minutes.
Service was restored at 12:45 UTC.
In the following half hour, sessions were brought back up one by one, fully restoring redundancy. This can be seen in the above graph (left) with the last distinct spikes.
Snowballing to fr-par-2
Now, it is time to talk about the impact on the rest of the region. The part I had skipped earlier on.
Regional VPC is also built atop BGP. The only detail is, both BGP and customer traffic go through a set of gateway devices at the edge of the zone; mainly to reduce route cardinality between zones. Those gateways peer with the route-reflectors using BGP; and propagate routes between zones through a set of route-servers.
BGP message rate on the route-servers, per session (left), per route-server (right).
In fr-par-2, the route-reflectors also are proprietary appliances, and they also have a lot of peers. You guessed it, those BGP sessions also flapped.
Due to those flaps, regional VPC was broken for fr-par-1 and fr-par-2. VPC Managed Services like DHCP, DNS, and others, were also more than spotty due to being hosted on those gateway devices (3). The net effect: hypervisors kept losing and getting back the routes to those services.
What did we do?
In the days following the incidents, we launched a worldwide campaign to replace all of the appliance route-reflectors with FRR route-reflectors. We know from benchmarking that FRR is much better at handling our kind of loads; and from experience is much better at signaling to their peers that, yes, they are alive, don't cut the session please.
VPC does not have any route-reflector based on proprietary appliances left in production.
BGP timers were relaxed to be more in line with our usage.
We have a much stricter alerting and monitoring of our BGP to have much quicker detection.
Our dashboards gained some level of indexing to better navigate this deep infrastructure.
And finally, we had a wild idea: getting rid of BGP. But first...
Reflections on BGP
When do features become anti-features? When do safety mechanisms become harmful? After all, this story is not the only distributed systems horror story, where circuit breakers, protections, and such prevented a system from getting back online. Where the additional noise caused by instabilities overloaded the system and hindered recovery. In hindsight, this story is mundane and boring.
Both data-plane instability experienced by the customers and control-plane instability experienced by the provider (us) were caused by the same BGP mechanism: when a BGP session is down, all routes received from this session are considered unreachable, and are withdrawn.
On the hypervisor, this means the route is removed and the destination cannot be routed to.
On a route-reflector, this state (unreachable) must be propagated to all neighbors.
BGP links the reachability state to the session state. And it makes sense! For what BGP was designed, at least. Which is peer-to-peer signaling, and best path selection, over router interconnects. For what it was designed, this is a brilliant idea, as the session goes over a single piece of wire or fiber, straight between two routers. If for whatever reason the link is down or the peer router is down, well yes, this path is not valid anymore, because it goes through the peer.
If the BGP session between routers B and D breaks, it is safe to assume that this path is broken, and we should not go through there anymore. From the point of view of router D, B is as good as dead.
This definitely makes sense for eBGP. For overlay networks, however, we are using iBGP. The session goes over multiple devices, routers and switches. Heck, the route reflectors themselves run on hypervisors. This is aggravated by the middleman, the route-reflector (which in some cases are middle men), because the peer no longer is on the path of the destination.
Suddenly, one of the core premises of BGP disappears: the session state does not reflect the reachability of the destination.
This is why all of our route-reflectors are in clusters: if one breaks, we don't lose all routes.
When using iBGP, routes being withdrawn on sessions down is an anti-feature that is actually harmful for the system. A session between large peers flapping will flood the BGP instance with basically noise, and may lead to catastrophic failure.
Is it BGP's fault, or are we just holding it wrong?
The future
Let’s not answer that question, because it's neither. Yes, BGP could have yet another extension to better handle the situation. Yes, we could have done things differently.
But the cold, hard truth is: we outgrew BGP as the control plane for VPC. The stakes are simply too high and it is not the right tool for the job anymore.
We're launching internal R&D work to evaluate the scenario of replacing BGP, to completely overhaul VPC. And in the pure Scaleway tradition, this will be a home-grown solution, certified Made in France. To get rid of BGP. Not because it's a bad technology, but because it may not be the right one anymore.
In the meantime, we'll keep improving the BGP infrastructure, and make sure such an incident never happens again.
Conclusion
We hope this communication was useful and helped you understand what happened. We are continuously working on improving our services and sincerely apologize for any inconvenience you may have experienced.
Footnotes
(1): A "flap" is when a connection, be it physical or logical, goes down and then comes back up quickly. Those may generate "blips" in the network, where the service is flaky, but not fully broken (well, usually).
(2): By design, those devices have full access to all VPCs and Private Networks, thus are the perfect place to host such services.
(3): The RIB is the full routing table known to the BGP instance. It holds all the routes and paths to destinations. It is the contents of the distributed database that BGP is.
Scaleway provides updates in real time of all of its services’ status, here. Feel free to contact us via the console for any questions moving forwards. Thank you!