Infrastructure as Code @Scaleway (2/3) - Internal usage

Infrastructure as Code (IaC) is a way to describe infrastructure in a language that is stored as a text file, just like code.

Scaleway manages a lot of resources. Efficient tools are essential to manage all these assets at scale. In this article, we will review the different tools we use and how our product teams are using them to deliver your cloud resources as efficiently as possible.

We want this article to be easy to follow and explicit enough so that you fully understand what happens when you boot up an instance.As a result, the list of tools presented is not exhaustive.

Another aspect of this article is that not all of these tools can be considered as "Infra as Code" per se because some do not deploy anything. Nevertheless, they are helping to bootstrap an environment in which Infra as Code becomes possible.

As a service provider, we start with basic and unconfigured hardware such as pristine servers (from different manufacturers), unconfigured routers and switches. We have to properly install those to have an environment that can support a higher level of abstraction. This is a classical Bootstrapping problem.

Let's first start with the tool we use to store the inventory of all the raw devices: Netbox.

Netbox

Netbox is an open-source web application designed to help manage and document computer networks.
Netbox is a Data Center Infrastructure Management (DCIM) tool and an IP address management system (IPAM).
Netbox can be seen as our single source of truth for devices and network infrastructure.

Each Scaleway product team is a tenant of a set of resources. Therefore we have an inventory of machines per team. We can also have search queries made per location, per model, per date of installation and so on.
For example, we can quickly test if an IP address belongs to Scaleway's ranges by looking into Netbox.

The customer data is not in Netbox! If you have an instance at Scaleway, it will not show up in Netbox, only the hypervisor on which your instance runs will.

Netbox is compatible with other infrastructure as code tools. It has a REST API that many tools share. One example of such netbox integration is ansible through its dynamic inventory plugin.

Now that we have seen where all the physical devices are inventoried, let's see where all the software is stored: Gitlab.

Gitlab & gitlab-ci (CI/CD)

All our internal code is stored on gitlab. Therefore, gitlab is our single source of truth for code.
Once new code is merged inside a given repository, a build can be triggered.

GitLab CI/CD is a tool built into GitLab which automates SDLC (deployment pipeline) with continuous methodologies.

This build will run integration tests using Continuous Integration (CI). This test ensures that we don't have regression in the test suite of a given project.
This build will create artifacts and then the Continuous Delivery (CD) part will simply push those artifacts to the relevant environment (production, pre-production or development).

Typically, we use this tool to build docker images, push those images into internal registries and more generally deploy complete servers automatically using a single commit. Examples of this include:

  • cleaning of obsolete images inside registries,
  • deploying cronjobs,
  • deploying a specific application of a given version.

Now that we have seen where all the software is tracked and the CI/CD jobs are launched let's see how a typical machine gets booted up and controlled with a classical infra as code tool such as Saltstack.

Saltstack

Overview

Saltstack is an automation tool used internally to provision and configure multiple devices on an event based model.
Salt is based on the concept of state. This state describes how a typical machine should be installed and configured. The states are stored on a master node which acts as a single source of truth for provisioning and configuration.

Saltstack is a client server workflow: a client called "minion" connects to a master node. The "minion" identifies the "master" by its name or IP address. The "master" identifies the minion by its host name. Communication between the server and the clients takes place after the acceptance of the "minion" by the "master" and after acceptance of an exchange of encryption keys. Minions can then be ordered in batches using criteria such as operating systems, regular expressions on host name, architecture types, etc.

One of the core features of saltstack is to have a dynamic workflow based on events. The status files (aka the "states") describe the state in which a server should be. Minions have a watch on values. As soon as one of these values changes, it triggers a refresh on all the minions that will regenerate their pillars based on the new value. This allows us to deploy large amounts of updates quickly on all our machines managed with salt minions.

When we want to provision a new machine, we install a salt minion on it. But one could ask:

How does a machine go from pristine to connected to our saltstack cluster?

Once a machine is racked, the MAC address of the NIC of the machine is provided to our instance team.
An image will be assigned to this MAC address. Once the machine boots up and shows up on the network, it will get an address from our DHCP server. Once it has an IP address, the server will try to boot over PXE to the image configured. This first installation builds the basics of the machine: install an operating system and the required dependencies (ssh, authentication mechanisms, certificates, ...). One of these dependencies is precisely the salt minion. The machine is ready once the "minion" is installed and configured to talk to the master. It can fetch its configuration and be configured with code just like any other.

Our usage

Saltstack comes as an open source software with community based formulas.
We wanted to have full ownership over the formulas used inside our infrastructure. As a result, we implemented our own formulas to adapt them precisely to our needs.

Other nice features of saltstack include a complete log of all actions performed on the servers. This kind of feature is missing from tools such as Ansible that aims to leave no tracks on a server. With saltstack we have a guarantee that all actions will leave an auditable track, useful to know what happens on a machine or to debug a system.

We can also send commands to a system even if the ssh daemon encounters a problem. This is particularly useful when ssh experiences configuration issues and risks being locked out if only ssh is used.

Recurring operations are also easy to implement natively. This is particularly useful when we want to investigate and report a particular issue.

Now we have a set of machines that are tracked through Netbox, code to manage them is stored on gitlab, and a salt minion installed to control them remotely. We have the basics of a running and automated IaaS.
After this point, the product teams are free to use the automation as they want. Many of them use Ansible.

Ansible & Awx

Ansible

Ansible is an automation tool used to provision and configure multiple devices.

Product teams are autonomous on their provisioning. This tool is used by product teams when they want to provision a given server, but also by the networking teams to configure switches and other networking devices.

Many internal roles are also available to help teams setup common configuration such as monitoring configuration.

Authentication is based around a PAM system plugged over LDAP. When Ansible wants to connect to a given machine or run commands with sudo, the access is managed through sssd that will query our internal LDAP.

At Scaleway, developers can run playbooks from their laptop to deploy to production but they can also use a job system such as awx to have recurring jobs.

Awx

Awx provides a web-based user interface, REST API, and a task engine built on top of Ansible. It is the upstream project for Tower, a commercial derivative of AWX.

Awx is mainly used by our teams to launch recurring jobs and deploy workers that manage our hypervisors.
Advantages of Awx are the reports that are available every time a job is launched. These reports are not available inside saltstack.

Finally, once all the required tools are made available by the given team, some can use Kubernetes to have a full deployment managed with code.

Kubernetes

Teams can run directly on top of our managed Kubernetes. By doing so, they can have the full scaling features provided by Kubernetes for their product.

Resources are encoded as Kubernetes objects, tracked inside a git repository, and applied as the code gets pushed to production.
As for the packaging aspect of these, we use Helm as a package manager to install all our services.
We use Helm charts and try to upstream as much as possible the one we use.

Conclusion

As detailed, there is not a single tool that can solve all the infrastructure management challenges that a platform as large as Scaleway encounters. Each of the tool we presented solves a specific part of the problem.
Tools are used to solve a specific problem in a specific context. That's why the different teams have different tools. They use the most useful one for their specific context.

As the feature set of those tools changes, so does their usage. One tool can be seen as fit for a particular moment but will not necessarily scale as well as another as time goes on. That is why it is important to keep an eye on how to solve problems more efficiently while keeping the platform stable.

In the next article, we will see all the tools you can start using today with your Scaleway infrastructure, to keep track of your resources and be more efficient in your deployments.

Recommended articles

Terraform: how to init your infrastructure

If you want to quickly and easily set up a cloud infrastructure, one of the best ways to do it is to create a Terraform repository. Learn the basics to start your infrastructure on Terraform.

TerraformInfrastructureInfrastructure-as-CodeDiscoverQuickstart