K8s security - Episode 4: Managing third parties
Along with user accesses, you also need to control what is being authorized by the services you did not create yourself, and that you depend on: third parties.
In episode 4, we detailed some of the main security issues that are found in software, and it is no surprise that information leakage is one of the most frequent security flaws.
Data regulation exists around the world, with various laws, restrictions and agreements. Each country or group of countries inevitably starts making its own rules when it comes to personal information.
Case in point, we have the the GDPR (General Data Protection Regulation) in Europe, the LGPD (Lei Geral de Proteção de Dados) in Brazil, the PDP (Protección de Datos Personales) in Argentina, and the PPA (The Privacy Protection Authority, formerlyILITA) in Israel. Of course, there are more, each with different levels of personal data protection, regulation, and authorization.
When it comes to businesses working across borders, data regulation since the creation of the GDPR in 2016 has become complicated to say the least. A lot of questions have been raised, especially when it comes to data pipelines, anonymization, data mining, and machine learning.
There are many data protection rules, but if we want to do the best as we can to respect them, it really comes down to two very simple concepts:
Of course, these rules might be "simple" to understand, but implementing them can very quickly become a nightmare and raise a lot of questions depending on the data processing you need to carry out.
With the GDPR, any customer can ask for his personal information to be removed from any database at any time. Now, let's imagine that you have a customer's personal information in a database, and also another database with anonymized additional data.
If you don't have an association table, you cannot remove the anonymized data, but then, do you actually need to remove it if you cannot trace it to its original owner?
Let's go further, and imagine that this data is used in the training datasets of a machine learning algorithm which takes days to run. Are you supposed to delete this specific data entry from the training datasets, meaning that you will need to re-train your entire machine learning model?
Fortunately, we are given some latitude here since complying with everything at once is impossible for some businesses, and it is often considered that if data cannot be traced back to its original owner, this is "good enough" in terms of compliance.
But will it stay that way? What other regulation lies ahead of us and what solutions would we need to have to address them?
If you want to have a look at data regulation around the world, the CNIL (Commission Nationale de l'Informatique et des Libertés) provides a world map showing the different degrees of data regulation around the world.
Going back to the technical side of things, Veracode issued a whitepaper covering the biggest data breaches of 2020. The number of customers and companies impacted is beyond imaginable, and most them are barely known to the public.
It also shows that giant companies such as Microsoft or Nintendo are not immune to data breaches and security flaws, and that from small businesses to IT Giants, security should be, more than ever, everyone's concern.
The biggest breaches expose personal information publicly, ranging from personal communications to account credentials, and add up to billions of records over the course of 2020.
The data reveals that information leakage,
CRLF
injection, cryptographic issues, and code quality are the most common security vulnerabilities plaguing applications today. Fortunately, we know that through secure coding best practices, educational training, and the right combination of testing tools and integrations, developers are able to write more secure code from the start — which means producing innovative applications that avoid cyberattacks and reduce the risk of costly breaches.
citation source Veracode - The Biggest Data Breaches of 2020
How can data can be managed and protected in a Kubernetes environment?
Kubernetes is often described as stateless, meaning that it is not meant to host persistent data directly on the nodes. This is only logical, since nodes managed by Kubernetes can be auto-healed (i.e. replaced) automatically to ensure the health of the cluster, and can even be created or deleted thanks to the node auto-scaling feature.
Scalability, cost control, and constant cluster health checks come at a price, and this price is statelessness.
Statelessness does not mean that data cannot be stored while using a Kubernetes cluster, just that using the local filesystem of your cluster nodes is not the way to do it. That is where persistent volumes come into play, allowing you to use remote storage (of type block / RWO
or nfs / RWM
) and access it through your pods.
Persistent volumes can be used to store any kind of data, and even used as the storage system of a database, managed within a Kubernetes cluster. And as with any Kubernetes object, access to persistent volumes can (and should) be restricted.
Additionally, it is always good practice to encrypt the data you store. Most cloud providers' CSIs (Container Storage Interface) implement the encryption of persistent volumes.
In software development, we can identify at least three types of data, each of which has its own specificities and requirements, and each of which should be treated according to its nature and purpose. Using the same data storage for all types of data does not make any sense, as some will be read and/or written and/or updated often, some need specific indexation, and some might be present only for very few cases or needed only once in a while.
Software data
99% of software needs a database to store the information it displays, processes, or digests. In most cases, this data is stored in a relational database, separate from the production environment. Using a managed database-as-a-service allows organizations to rely on their cloud provider for database security, architecture redundancy, and backup policies.
Managing such databases can also be done within a Kubernetes cluster, relying on persistent volumes for data storage. However, redundancy and backup policies are not necessarily defined by default, and remain the responsibility of the customer. In the end, this solution has some drawbacks in common with a self-managed solution.
The data that is needed for a software to run is not to be considered critical. Of course, losing it would be painful as you would need to recreate all your datasets, but if you have a backup, even if it is a week old, it will probably be enough to save face with your customers.
Also, as long as no personal or confidential information has been leaked, your production environment and overall business will be as good as new in no time.
User data
Personal information about customers is essential. How could you even prepare an invoice without this information? From the name of your customer, their location, to their credentials and credit card number, you need personal data. This data is critical and valuable, and it is the first information type concerned by data regulations such as the GDPR.
A leak of your users' personal data can have disastrous consequences - you stand not lose not only information, but also customers, trust and your good reputation. What's more, depending on the country, data regulation officers might come knocking on your door.
Often, user data is not stored any differently than software data, and the two most likely share the same database and the same access rights. They should not.
User data and software data have very different degrees of criticality, this is why they should benefit from relevant access restriction and security policies. Also, if there is one data type which needs encryption, it is user data.
Analytics data
Metrics, statistics, aggregated data, sensors, valuable data specific to each business...all of this is valuable data that can be processed in multiple ways.
From simple statistics to data transformation pipelines, and even machine learning algorithms, we transform our data through external services, one pipeline at a time, one third party after the other.
This data can be the added value you have compared to your competitors, and that is what makes it extremely important. Often, it requires dedicated storage because of its volume, format, and specific querying requirements (e.g. a search engine needs to read data efficiently whereas performance while writing data is not relevant).
Analytics data can be recomputed from a single source of truth: a database where raw anonymized data is stored. If the analytics data is lost, it can always be recomputed, with downtime of course, but downtime should never be critical.
Using persistent volumes to store such data in a Kubernetes cluster can be an interesting solution, as it allows the use of a pay-as-you-go (pay-as-you-grow?) model.
To end on a real-life example, here is a little story about a breached database.
Once upon a time, a company was using Redis for caching data, installed on a dedicated server. The server was not well protected, nor was the Redis service. An attacker breached the server, deleted all the data there was on Redis, and inserted one piece of data to replace them all: "Not well protected, be careful next time". Two days later the server was transformed into a fortress.
Fortunately, no critical data was stored in Redis...
Along with user accesses, you also need to control what is being authorized by the services you did not create yourself, and that you depend on: third parties.
You will find here a security cheat sheet with the simple purpose of listing best practices and advice to protect your production environment when running it on a Kubernetes cluster.