Tripling the lifespan of servers: why we retrofitted 14,000 servers

Over the last few decades, repairing electronic devices — something that used to be an absolute standard — has become pretty rare and often impossible. These days, we even have to politically campaign for a “right to repair” items that we own.

So when electronics stop working, throwing them out often seems easier, creating mountains of often toxic trash. That goes for your old MacBook and phone, but it also goes for much bigger machines: the servers that run the cloud.

The carbon footprint of new servers is huge

Server manufacturing represents 15–30% of each machine’s carbon impact. Reusing existing machines, rather than buying new ones, reduces e-waste — the fastest-growing waste category today — and lowers greenhouse gas emissions. So when we noticed that a bunch of our old servers had a high failure rate but were otherwise performing well, we decided not to throw them out but to retrofit them — all 14,000 of them.

We adapted and optimized our workshop at our DC5 data center to handle this new project and extend our hardware lifespan. Today, our servers are used for up to 10 years — versus the industry average of three to four years — and nearly 80% of components are recycled.

A new life for old servers

We deal with thousands of machines on a daily basis, and one issue kept popping up in our metrics: our older servers were performing well, but they had the highest failure rate among our machines.

Our investigations showed that the issue came from the RAID controllers: they have a battery, which increases the failure rate. The fact that there were server issues because of battery failures also indicated that even if we replaced them, it wouldn’t lead to long-term, reliable performance.

Most modern servers are not equipped with hardware cards. So removing the physical RAID controllers should not significantly impact the performance. These days, RAIDs are mainly software RAIDs created with mdadm, which is more reliable in the long term.

Our decision to remove the RAID controllers led to a large-scale retrofitting project: after ensuring that the servers were still compatible with recent technologies and would fit clients’ needs, we decided to retrofit them.

All of them.

Our goal? To achieve a high level of reliability and performance on these servers through a three-step qualification, testing, and validation process.

We began by setting performance objectives for the finished product and taking a more detailed inventory of the underperforming servers. We grouped the servers by a specific set of criteria, including information such as their physical location (which data center they were in), the chassis, the CPU, the catalog they are currently sold in, what catalog they could be sold in post-retrofit, etc.

Once we had set out our performance objectives and identified our groups of servers, we needed to test whether those transformations would be possible through Proof Of Concept (POC) for each server group.

Proof of Concept

We put together a checklist for our hardware engineering team to refer to as they completed the POC process so we could determine the constraints and requirements for each lot of servers. We had 24 POCs, each with its own step-by-step procedure and qualifications.

We did as many remote checks as possible, such as checking if there were DIMM slots available, which helped us understand what parts may be required for the physical checks and whether we needed new cables or a Host Bus Adapter (HBA).

Once we progressed to physical checks, we completed them in situ in data centers to see how the servers performed in a production environment. One of the items on the checklist was making sure the RAID card could be physically removed or we could do a “pass-through” that would bypass the RAID and allow access to the disks.

We then tested the writing and reading speed on all SATA modes (ATA, RAID…) to ensure the performance was as good as, if not better, than before.

The next step was checking disk compatibility and performance. We needed to ensure the performance respected our SATA, AHCI, and soft RAID mode threshold values.

We then performed a RAM upgrade and validated the performance. Of course, we ensured all different MHz possibilities were correctly detected and functional. The same thing was then done for the CPUs. And finally, we validated the firmware version for the different components (BIOS, BMC, etc.).

Once all those steps were completed, we installed different OS to ensure the servers were fully compatible and functional. Once they had gone through the POC checklist, the hardware engineering team would give a go or no-go for the retrofitting of each server group.

Unracking and retrofitting 14,000 servers

Our 20+ years of hardware expertise helped us diagnose thousands of servers to determine the exact point of failure on every single server we had previously isolated. We then made an extensive list of the state of each server for our technicians.

Next, we had to check that all servers were fully functional with (another) checklist that ensured everything from whether the server booted up to checking the global hardware condition via tasks. We also upgraded the firmware stack.

Once the data center team received the definitive list, they started to unrack the servers. Transferring 14,000 servers is a bit intense, so we hired a special team to unrack and move them at a pace of hundreds of servers per week. Special thanks go to our logistics team, who helped us move all those servers from one data center to another, which was an important step in this project.

Unracking the last server

When the servers arrived at DC5, our team of technicians performed the upgrades and reconditioned the servers based on the step-by-step procedure that the hardware engineering team had laid out during the qualification phase.

Ultimately, our teams successfully processed and retrofitted thousands of servers.

They reassembled and racked the servers, leaving just one more thing to do: deploy! Our Bare Metal team installed the tools needed for our Dedibox offers and put the servers online for client use.

Trying to get ahead of the worldwide chip shortage

Not all servers qualified for the retrofit process, but we didn’t just throw them out. Servers that couldn’t reach the necessary performance levels were split into parts we will reuse for repair. We harvested as many parts as possible to build up a stock of spares. After all, these servers are relatively old and will need maintenance in the future.

Maintaining servers is complicated. In some cases, manufacturers have stopped making the needed parts. Additionally, the pandemic has slowed down a lot of production, and there’s a worldwide shortage of electronic components to add to all that. It will only get more challenging to provide regular maintenance on old servers.

But the decision was still a no-brainer for us because we believe in sustainability. In the future, we will repeat this process when necessary. We will create new POCs to account for new servers and Scaleway product offers.

In the long run, these are investments that are good both for business and the environment. Resources needed to build new servers are not unlimited, and we’re proud to have developed a system to retrofit our old ones with minimal waste.

We can only hope that, in the future, servers will be designed to be more resilient by having generic parts that facilitate maintenance.

Recommended articles