Multi-Cloud Kubernetes best practices

Scaleway Kubernetes Kosmos is the first Multi-Cloud Kubernetes engine on the market. It offers the possibilities for multiple Cloud providers to coexist within the same Kubernetes cluster...

Using Kubernetes in a Multi-Cloud environment can be challenging and requires the implementation of best practices.

Labels

Labelling resources significantly helps you manage your configuration as you can use selectors and affinity rules. As a result, when working in a Multi-Cloud Kubernetes cluster, this step is strongly encouraged or even mandatory.

It is highly recommended to at least label nodes with information regarding their specificities, such as the provider managing them.

For example, our cluster nodes can be set with a provider label, such as provider=scaleway.

Isolating workload across providers

Some specific workload might require running on specific hardware (such as GPUs) but it can also be preferable for applications to run on a dedicated Cloud Provider as well, may it be for ownership, legal, or technical reasons.

In that specific case, Kubernetes taints and tolerations will allow you to add very specific rules to nodes and applications.

One replica per provider

If an application needs to run on each cloud provider’s network, anti-affinity rules can be set as follows.

antiaffinity.yml

This way, each cloud provider used within the cluster will have an instance running one replica of the deployment.

Distribute our workload across providers

When a workload needs to be spread across multiple Cloud providers to ensure a very high availability of services, the usage of pod topology spread is relevant.

topologyspread.yml.webp

This yaml file describes a balanced distribution of busybox-everywhere pods with labels app=aroundtheworld across nodes depending on their provider label value.

Using Scaleway pools to bufferize workload

It is not simple to benefit from Kubernetes node auto-scaling feature when running a Multi-Cloud Kubernetes cluster. Nonetheless, when using Scaleway Kubernetes Kosmos, this feature is available within Scaleway node pools.

With a minimum number of zero nodes, Scaleway node pools are an ideal solution to buffer any unexpected workload at a minimum cost.

It is highly recommended to use such pools with the node auto-scaling feature activated, to ensure the highest availability of any production system and prevent any potential issue across all Cloud providers.

Node auto-scaling feature in Scaleway Console

Services exposure

Exposing HTTP services within a Multi-Cloud cluster is no different than in a regular cluster. In a managed Kubernetes cluster such as Kubernetes Kosmos, the Cloud Controller Manager will create a Scaleway Multi-Cloud Load Balancer.

Storage - Deploying CSI

Cloud providers offer storage solutions that can only be attached to their own infrastructure services. In the case of Multi-Cloud environments, it is a real constraint that needs to be taken into account when designing a production architecture as well as a software. While using an external database solution is recommended and compatible with any Multi-Cloud solution, Kubernetes users sometimes require persistent storage within their Kubernetes clusters.

Persistent volumes are managed by the Container Storage Interface (CSI) component, and fortunately, almost all Cloud providers made their CSI open source, allowing customers to deploy them within their clusters.

The best practice remains to use the labels set on different nodes to target the instances of each provider and provide them with the corresponding CSI.

Points of concern when going Multi-Cloud with Kubernetes

As Multi-Cloud flexibility comes along with complexity, there are a few topics and behaviors to keep in mind when implementing a Cross-Cloud Kubernetes cluster.

Managed Kubernetes engines implement many components that have different behaviors depending on the Cloud provider providing them. Each of which can have an impact on the behavior of software and applications running in a Kubernetes cluster if their principles are not understood.

Cloud Controller Manager

The Cloud Controller Manager (CCM) is a component of Kubernetes control-plane implementing a cloud-specific logic.

When using a managed Kubernetes engine from a Cloud provider, the creation of a Kubernetes Load Balancer service will more likely lead to the creation of a Load Balancer on the customer’s Cloud provider account. This logic is in fact implemented by the CCM, and every Cloud provider will configure it to connect to its own services APIs.

What happens then in a Multi-Cloud Kubernetes cluster?

The Cloud Controller Manager being a component of the control-plane, its behavior is the same regardless of the host of each cluster node. It is then the responsibility of the CCM maintainer to implement the expected behavior.

When using a Scaleway Kubernetes Kosmos cluster, the CCM implements the creation of a Scaleway Multi-Cloud Load Balancer and allows the exposure of HTTP services from a Multi-Cloud cluster.

Container Storage Interface

The Container Storage Interface (CSI) component manages the interface between a Kubernetes node and storage solutions such as block storages or Network FileSystems (NFS).

Just like for the CCM, the CSI is an interface between Kubernetes and a Cloud provider’s APIs. However, instead of having a global behavior over the whole cluster, a customer can install as many CSI as he wants.

Storage management within a Multi-Cloud Kubernetes cluster then implies multiple CSI management; at least one per Cloud provider used in the cluster, if persistent storage is needed.

It also implies that persistent volumes hosted on provider A cannot be mounted on a node from provider B, thus reducing the possibility of redundancy, data access, and data recovery.

Fortunately for Kubernetes users, almost all Cloud providers have made their CSI open source, making it easy to install the needed CSI on our Kubernetes clusters using Kubernetes selectors.

The list of open source Kubernetes CSI can be found here.

Autoscaler

The node auto-scaling feature offered by Kubernetes allows the automatic addition or removal of nodes from your Kubernetes cluster depending on its workload.

When working in a Multi-Cloud Kubernetes cluster, two options can be considered. The first one is to completely deactivate the node auto-scaling feature, as managing a Cross-Cloud node auto-scaling strategy can quickly become too complex. It also implies managing all providers’ accounts and credentials. The second option would be to authorize auto-scaling on nodes from only one Cloud provider.

With Scaleway Kubernetes Kosmos the choice was made on the latter option.

In fact, a Kubernetes Kosmos cluster can contain multiple Scaleway node pools in multiple regions, each of them implementing the node auto-scaling feature.

Node auto-healing

The node auto-healing feature (managed by yet another Kubernetes component) has the same constraints as the node auto-scaling feature. It requires the permissions to perform sensible actions on a Cloud provider user account. For this reason, managing auto-healing in a Multi-Cloud cluster is complicated and implies the same choices as before.

For the sake of simplicity and consistency, Scaleway Kubernetes Kosmos only manages node auto-healing for Scaleway Instances, just as it would for any standard Kubernetes Kapsule cluster.

While running a Multi-Cloud Kubernetes cluster, and by extension in a Kubernetes Kosmos cluster, it is advised to have a fallback strategy in case of the loss of another Cloud provider infrastructure. As such, we recommend keeping at least one Scaleway node pool with the auto-healing and auto-scaling features enabled to absorb any eventual workload in case of the failure of part of the infrastructure.

Network

A Multi-Cloud architecture comes with complexity, but also constraints, as multiple Instances from different providers communicate with each other within the same network.

First of all, it raises the question of low latency versus high availability. When one of the main purposes of having a Cross-Cloud cluster is obviously to have a highly available infrastructure, it implies a latency that can be hard to measure and anticipate. This is not necessarily a major concern, but it is an important point to consider.

Secondly, as nodes are communicating with each other through a dedicated network, based on a VPN solution, the usage of a Cloud providers’ private network is out of the question. By design, using a provider’s private network requires that all resources available in this network are hosted and managed by the same provider.

Conclusion

Implementing a Multi-Cloud strategy has always been a real challenge for every company as it requires in-depth knowledge of the infrastructure and the underlying technologies to successfully achieve its objective. Keeping simple logic rules and being mindful of the overall behavior of your architecture are essential to running a smooth Multi-Cloud Kubernetes environment while making the most of its many functionalities.

Recommended articles