How to build a Hybrid Cloud Infrastructure with Elastic Metal
How to leverage the benefits of both the public and the private cloud? How to build a IaaS infrastructure on top of dedicated servers that fits your cloud ecosystem and processes?
The untold story of Elastic Metal began in January 2018 at Scaleway, when our team of six devoted individuals embarked on an ambitious mission: position Scaleway at the forefront of the industry of the Cloud.
The Dedibox stack, which had served faithfully for 13 years, was starting to show its limits compared to the advanced solutions available in the market. We realized that it was isolated from Scaleway's innovative cloud products that we were actively developing. Driven by the desire of innovating, we worked to create Elastic Metal—a groundbreaking solution that would redefine the boundaries of Bare Metal hosting. We knew that our success relied on quick action and innovation.
Thus, Elastic Metal emerged as the answer, enabling us to overcome obstacles and seize new opportunities in the ever-evolving Bare Metal landscape. We knew our competitors were already on their way, so we had to move quickly.
We dedicated several days to exploring various solutions, ultimately leading us to two promising options.
The first solution revolved around prioritizing speedy delivery. To achieve this, we proposed the creation of a new service within Scaleway's infrastructure, responsible for seamlessly managing all the dependencies of our information system (IS). This encompassed trust management, quotas, CLI integration, and Terraform implementation. Additionally, we planned to establish a connection with the Dedibox backend to oversee server-related tasks such as ordering, maintenance, and installation. However, we were mindful of certain drawbacks associated with this approach, including the absence of worldwide services, limited isolation by availability zones (AZ), lengthy installation times, a distinct network infrastructure compared to Scaleway's, the lack of cloudinit support, fixed features, and no future planned enhancements.
The second option was more comprehensive and ambitious. It involved integrating the service from the first solution into Scaleway's ecosystem and developing an entire backend system for server management. While this approach offered greater control and customization possibilities, it entailed a significant investment in development resources to accommodate the new network topology, implementing a fresh installation stack, and handling other associated challenges.
In view of the workload and the delays required by the second solution, we started with the first solution, with the aim of gradually migrating to the second.
Now we had to choose the most suitable language. We had no particular technical constraints except having to connect to the Scaleway gateway which transforms all external sources: CLI, API and Terraform into an internal gRPC call. After a first study to see if it was feasible in PHP / Symfony, we realized that as it stands, it would be very complicated to implement the gRPC server. We left for Go in pairs, while the second PHP / Symfony pair would take care of Dedibox and help us by creating the few new API routes essential to the new product.
Within a span of just six months, we developed a groundbreaking service known as BMaaS (Bare Metal As A Service). This offering leverages Scaleway's robust authentication system, quota management, and Trust infrastructure. To establish a connection with the Dedibox component, we adopted a simple approach: each account within the new service was linked to a corresponding "phantom" account in the Dedibox IS. To facilitate this integration, we introduced a new admin endpoint in Dedibox. This endpoint allowed us to retrieve existing Dedibox accounts associated with Scaleway accounts, or create new ones when necessary. For the operational aspects, BMaaS interacts with the Dedibox API through conventional means. This includes executing server commands, managing installations, and enabling remote access. By utilizing the established Dedibox API, we ensured compatibility and seamless functionality while delivering an enhanced experience through BMaaS.
One particularity of these new accounts was that they should not be billed by Dedibox, as Scaleway assumed responsibility for them. Additionally, their quotas were intended to be unlimited, as they were managed within Scaleway's new service. To avoid any potential vulnerabilities, safeguards were implemented. These included restrictions such as accounts being unable to directly access the Dedibox console, with only API access being permitted. Consumption of BMaaS within Dedibox was monitored to ensure alignment between the two systems.
The initial version of the service progressed to a private beta phase, catering to a select group of clients. This initial release, referred to as v1, allowed users to consume Bare Metal on an hourly basis. It supported the installation of three operating systems (Ubuntu, Debian, and CentOS) with predefined partitioning, as well as booting from a rescue image. Continual iterations followed, introducing features such as remote access management, including KVM for custom installations, reverse management for both IPv4 and IPv6, and server event listing. This progress led to the opening of the product for public beta testing from July to October 2018.
Further enhancements were made to support IP Failover (IPFO), enabling the ordering of additional IPv4 addresses that could be assigned and moved between machines.
Efforts were directed towards preparing for the introduction of new offerings to create a more comprehensive catalog. Within these new offerings, two distinct configurations were specifically designed to deliver the best possible performance available in the market. These configurations can now be found in the Titanium range.
The tooling team took charge of integrating our service into Scaleway's CLI and Terraform, thereby completing the cloud integration of BMaaS.
It was time to address the next phase of our initial objective: introducing a new backend stack to manage Bare Metal servers and modernize the installation process. We began exploring various server installation solutions as proof of concepts to replace the Dedibox backend and bring additional features and speed to our installation system.
We identified around twenty solutions available in the market. After attempting to integrate OpenStack's Ironic for three months, it became evident that we were unraveling a complex web of dependencies that needed to be implemented to make it work: Neutron, Glance, Keystone, and more. Ultimately, a minimum of twelve third-party services were required to run Ironic, and extensive integration work with our networking tools and authentication systems was necessary. To avoid spending excessive time on integrating OpenStack into our infrastructure while managing our constraints, we decided to halt the POC at that point.
We shifted our focus to two new solutions: developing an in-house plan of action, or testing Equinix's Tinkerbell.
After two months’ study of both subjects, we gathered in early July 2020 to discuss the solutions and initiate the integration of the chosen option. During the discussion, the team deliberated two perspectives: Tinkerbell provided a quick, ready-to-use solution with an active community, while the "homegrown" solution would require development time, but offer flexibility in integration and management of specific cases related to our infrastructure.
To mitigate the risk of experiencing several months of integration challenges as we did with Ironic or encountering major roadblocks with Tinkerbell, we decided to pursue the in-house solution, internally codenamed BATMAN (Baremetal Assets meTadata Management and AutomatioN). Rumor has it that the acronym was created after the name was established.
At the beginning of the Covid-19 pandemic, we found ourselves inundated with requests from our clients, seeking assistance in preparing for the transition to remote work. Our top priority was to rapidly deploy as many machines as possible to meet the increasing demand.
Simultaneously, we were working on integrating new operating systems and implementing a video conferencing solution called BigBlueButton. An internal team was dedicated to providing two free video conferencing platforms: Jitsi on our Instances and BigBlueButton on our Bare Metal servers. (Learn more about our BigBlueButton solution powered by Scaleway)
To progress towards service autonomy and its integration into the Scaleway ecosystem, we collaborated with the Network Product team to transition the IP Failover service from Dedibox to a new Scaleway service: Flexible IP. Initially, the service encompassed the basic scope of functionalities, such as ordering an IPv4 address, assigning it to a server in a specific Availability Zone (AZ), and creating a virtual MAC address. Over time, it expanded to include IPv6 addresses, and soon users will be able to order IP blocks, with mobility extending from AZ to Region level.
As Scaleway evolved, the concept of projects for resource management was introduced. This empowers users to organize their portfolio of services and resources into distinct projects within the console. Eventually, there will be the capability to fine-tune rights and permissions between projects. In line with this vision, we ensured compatibility between the new Bare Metal service and this new feature by incorporating the notion of projectID into the service's API.
Until now, managing our offerings was directly done through the service’s database, which created a strong dependency on the Engineering team for updating offer information such as descriptions, configurations, and prices. To address this issue, we conducted a proof of concept (POC) for a product catalog solution to regain control over this aspect. Given the internal expertise and experience of our team, we tested Akeneo in early June. After several tests, we successfully integrated the concept of Availability Zones (AZ) and customized certain fields based on them directly into the tool. Within two weeks, the team completed the POC and transitioned it to the production environment.
This allowed us to quickly add the new generation of offerings to the Scaleway Bare Metal product by the beginning of July 2020.
The next feature we extracted from Dedibox was the BMC/IPMI functionality. We created a new service that enables us to perform actions such as stop, start, and reboot on servers and manage credentials for remote access, such as KVM. This service controls remote access by assigning a port/IP combination and allows users to connect to their machines via KVM. We started with an open-source solution and enhanced it with specific commands related to our hardware that were not initially supported.
Two significant features have been delivered by the Billing and Network Product teams.
The monthly payment and commitment fees were ready, which allowed us to lower the server rental price by offering it on a monthly basis. Additionally, Scaleway's private network now had an API that is compatible with our product, provided that it is on the same network topology. This implies deploying new offerings on Scaleway's Fabric network to offer this feature.
To quickly launch new offerings on this new network stack, the product team initiated the Transformers project. It involved retrieving old offers that will undergo a complete hardware check-up and upgrade, including adding more RAM and replacing outdated disks. Once the new machines have passed all the checks and received a performance boost, they are racked to form the new generation of products on this new network, compatible with all other Scaleway products.
To mark the changes brought by this new range of products, we decided to remove all old offers from the catalog and only keep those that are 100% compatible with Scaleway. The product’s name was also changed from "Bare Metal" to "Elastic Metal" to highlight that the product was a great mix between the traditional dedicated server and all the modern features of our cloud ecosystem. We now have a unified network, the ability to establish a private network between a dedicated server and other Scaleway products compatible with the private network, and the option to choose monthly billing for servers. The functional coverage of the product was getting closer to that of Dedibox, and the autonomous installation stack remains a fundamental topic that three people are working on full-time. On the Elastic Metal side, the team has expanded from one person to three, with two internal mobility transfers that have joined to strengthen the team and contribute to the product's progress.
Following the product launch, the team worked on integrating new operating systems such as Windows, ESXi, and Proxmox virtualization OS. We also added user/password management for connecting to the rescue environment, in addition to SSH key authentication.
The Network Product team expressed concerns about the reputation of our IPs, as fraudsters using our services for various types of mailing activities tended to harm the reputation of our IPs during the brief periods they accessed our services. To address this issue, we implemented a default block on the SMTP port and introduced a free option linked to a higher Trust Score to unblock the port. As a result, the IPs we provide have a better chance of remaining clean for our legitimate clients who wish to engage in mailing activities.
Some clients also expressed specific needs for their operating systems. Consequently, we worked on providing remote access via KVM for these new offers to offer more flexibility for advanced users.
As we approached the end of the first semester of 2022, with the expanded team, we tackled the issue of autonomy in initializing new racks arriving at the data center. The objective was to perform a complete remote check-up of the machines: verifying the operational status of the disks, initializing the network, updating BMC, BIOS/UEFI versions, pushing all machine information into the information system for management, and more.
Simultaneously, we started integrating the v0 of the new installation stack into BMaaS for initial testing. An internal v2 API was created to test the new installation system without impacting the currently in-production v1. The main challenge lies in successfully transitioning a machine from one installation network to another without affecting production.
By the beginning of the fourth quarter, a team reorganization took place, and the team dedicated to the new installation stack joined the BMaaS team to form a single team. This consolidation aimed to accelerate progress on the remaining blockers. With a few months’ head start in the new installation stack team, we now had a team of 5 DevOps professionals working on the entire scope of Elastic Metal and its autonomy.
As we approached the end of 2022, we continued to improve the integration of the new installation stack to ensure backward compatibility with the existing API v1. This would allow us to migrate all machines without impacting clients in the long run.
The development of the backend control for the new stack had progressed well. The team had covered the minimum viable product (MVP) defined by all the teams that will use it. This new control enables the listing of all machines in the fleet and filtering them based on the various statuses used by each team. For example, the Hardware team can list faulty machines for repair, and the Anti-DDoS team can list locked machines for network management. The remaining task is the integration into the admin console, which will be the next step.
In early December, we would open Elastic Metal in the FR-PAR-1 region with eight offerings, providing the service in multiple availability zones (AZ). Other followed quickly.
The beginning of 2023 marked the final sprint of our initial project to achieve a fully autonomous product with a new installation stack. This milestone will unlock a plethora of new features, including cloud-init, enhanced custom partitioning, lightning-fast installations, and rapid expansion into additional availability zones (AZ).
Our recent achievements include the launch of a new production offer (EM-A210R-HDD) on the updated installation stack. Additionally, we will soon iterate across the entire catalog. The migration process will be seamless and have no impact on our valued clients. Furthermore, our team is currently validating four new offers, set to be introduced at the beginning of Q4. Lastly, we are preparing for the opening of two new AZ (Availability Zones), namely WAW-2 and WAW3, starting from Q4. Exciting developments lie ahead as we continue to expand and enhance our offerings, ensuring the best possible experience for our customers.
How to leverage the benefits of both the public and the private cloud? How to build a IaaS infrastructure on top of dedicated servers that fits your cloud ecosystem and processes?
Scaleway’s Serverless ecosystem is one year old today—the perfect occasion to tell you all about how Serverless was built, and the new features planned for the year ahead.