Doing AI without breaking the bank: yours, or the planet’s

30/06/213 min read

A 2018 study by OpenAI showed that the amount of compute power needed to train state-of-the-art AI models was doubling every 3.4 months. Such exponential growth translated into an astounding 300,000x-fold increase over the course of only 6 years - starting from 2012, the year widely recognized as the onset of the deep learning era of AI. This phenomenon is directly linked to the growing complexity of underlying deep learning models: so-called artificial neural networks (ANNs). Loosely inspired by the way our very own brains work, mathematically ANNs amount to matrices of numerical values, termed ANN weights or parameters. Suitable parameters are calculated during the computationally-intensive development stage called model training, and are then used to multiply whatever inputs are fed into the ANN, in order to produce (hopefully sensible) outputs.

More parameters, more power

A somewhat oversimplified rule of thumb here is: the greater the number of parameters, the more powerful the model. AlexNet, the neural network that kicked off the deep learning revolution in 2012, used 61 million parameters in order to classify images into one of 1000 classes. If 61 million sounds like a lot, wait until you hear how many there are in the largest neural network trained up to date: 175 billion! That is the number of values parametrizing GPT-3 (Generative Pre-Trained Transformer 3), developed by OpenAI in 2020. Given a prompt, GPT-3 can generate functioning code, write poetry, and produce text that can be nearly impossible to tell apart from that coming from a human being. This impressive feat of machine learning engineering would have been thought to be pure science fiction mere years ago, so there are high expectations for what the next biggest AI may bring to the table.

Wise training choices can reduce footprint by up to x1000

However, such ground-breaking achievements come at a high price. In fact, the costs associated with the compute resources required to train these models are two-fold. Starting with the obvious: the monetary expenses paid for by the research groups and the companies who are behind the models’ creation. However, it does not end there. The toll that the model training takes on the environment is the second cost, and one that is carried by us all. Fortunately, we do not need to let these considerations postpone the rise of the machines, as long as we choose wisely where the training takes place. Recent research coming out of Google indicates that certain choices made with regards to how and where we train neural networks can reduce the associated carbon footprint by up to 1000x!

GPT-3's impact in figures

Let us take the GPT-3 example and see how much energy can be saved by performing the training at an energy efficient data center vs. a traditional one. GPT-3 was trained on a cluster consisting of 10,000 V100 Nvidia GPUs: hardware accelerators designed with the sole purpose to optimize the calculations involved in the training of artificial neural networks. The power consumption of a single V100 GPU is 300W. How long did it take to train GPT-3? According to the original paper, the final model training required 3.14x1023 FLOPS (floating point operations per second). The total cost of training is likely to have been an order of magnitude higher, since a typical deep learning project involves training many different variations of a model before choosing the best performing one. For a low-end estimate, let us take 3.14x1023 FLOPS and only consider the power consumed by the GPUs (whereas in reality, we would have to add CPUs, network, and memory into the mix). Performance-wise, V100 can handle 14 TFLOPS when using single precision floating-point numbers, but in theory that number doubles to 28 TFLOPS for the half-precision format that was used by OpenAI. With 10,000 GPUs, this translates to about 13 days of training, or 10,000 x (300W) x (13 days) = 936 MWh of energy consumed by the GPUs alone.

Feel the heat

As anyone who has been careless enough to put a laptop on their lap and run one too many applications at once knows, machines heat up while working. 10,000 GPUs, that are pushed to their limit for two weeks straight, heat up a lot. In fact, much of the energy consumed at a typical data center goes not just to powering up the servers, but also to cooling down the data center itself. This, and other associated costs, are taken into account by a measure called PUE, Power Usage Effectiveness. This coefficient is a ratio A/B that measures the total amount of energy A needed to get the final amount of energy B to the server. PUE is always greater than 1, as the value 1 would imply perfect efficiency, where no energy was spent on maintenance of the data center at all. In a traditional data center, PUE might be somewhere around 1.5 or more. This means that the actual energy cost of training a model like GPT-3 would amount to at least 936 MWh x 1.5 = 1404 MWh. Is there a way to reduce this number?

Simplest answer: lowest possible PUE

The most straightforward way of cutting down the amount of energy involved in a cloud computation is to choose a data center with a lower PUE. Scaleway’s DC5, located in the suburbs of Paris, has a PUE of 1.15, meaning that the overhead is reduced from 50% to 15%. These reductions carry over to both the model creators, and to the environment which all of us get to profit from. To get back to the GPT-3 example, doing the training at a data center like DC5 would result in the energy usage of 936 MWh x 1.15 = 1076 MWh, saving 328 MWh, or 23% of the energy consumed by a traditional data center. To put this in perspective, it takes about 1 MW to power 2,000 French homes. However, while this calculation only took into account the training of the final model, in reality the experiments necessary to arrive at that point would drive the total energy cost up by 10x-100x. In other words, training a model like GPT-3 in an energy efficient data center like DC5, saves enough energy to run a medium-sized city for multiple days, possibly weeks.

At the rate that the size of state-of-the-art artificial neural networks is increasing, the difference between training the next biggest AI in an energy-efficient vs. traditional data center may very well power a small, but picturesque, European capital for a month. Food for (human and synthetic) thought!