- To fill in the cost estimator, you can assume that 1 metric sent without specific cardinality (ie. without labels or value duplication for a same metric) every minute will generate around 50 000 samples per month (60 minutes x 730 hours per month = 43 800 samples). By default, DCGM and node exporter will send multiple metrics and add labels to these metrics leading to a higher number of samples.
- We recommend that you complete this tutorial first to visualize your data, and then review your configuration to optimize the number of metrics or labels sent.
Monitor GPU Instances using Cockpit and the NVIDIA Data Center GPU Manager (DCGM) Exporter
- cockpit
- monitor
- grafana-alloy
- monitoring
- nvidia
- gpu-instance
This tutorial guides you through the process of monitoring your GPU Instances using Cockpit and the NVIDIA Data Center GPU Manager (DCGM) Exporter. Visualize your GPU Instances’ metrics and ensure optimal performance and usage of your resources.
Before you start
To complete the actions presented below, you must have:
- A Scaleway account logged into the console
- Owner status or IAM permissions allowing you to perform actions in the intended Organization
- Created a GPU Instance
- Connected to your Instance via SSH
- Installed Docker Engine and Docker Compose on your GPU Instance.
Create a Cockpit data source and credentials
Create a Cockpit data source
We are creating a Cockpit data source because your GPU Instance’s metrics will be stored in it and the exporter agent needs data source configuration information to then export your Instance’s metrics.
-
Create a metrics custom data source in Cockpit. For the sake of this tutorial, we will name it
gpu-instance-metrics
.Important -
Click your metrics data source to view information such as its URL and push path.
Create a token
-
Create a Cockpit token from the Scaleway console.
-
Select a region for the data source.
-
Tick the Push Metrics box and click Create token to confirm.
ImportantCopy and store your token securely. We will use it to allow the Grafana Alloy agent to push your metrics to the metrics data source you have created earlier.
Collect metrics from your GPU Instance
Install the NVIDIA DCGM Exporter, node exporter and Grafana Alloy agent on your GPU Instance
-
Copy and paste the following command to create a configuration file named
config.alloy
in your Instance:touch config.alloy -
Copy and paste the following template inside
config.alloy
:prometheus.remote_write "cockpit" {endpoint {url = "https://example-afc6-4d02-a2fd-bc020bbaa7d0.metrics.cockpit.fr-par.scw.cloud/api/v1/push"headers = {"X-TOKEN" = "example_bKNpXZZP6BSKiYzV8fiQL1yR_kP_VLB-h0tpYAkaNoVTHVm8q",}}}prometheus.scrape "dcgm_exporter" {scrape_interval = "60s"targets = [{__address__ = "dcgm_exporter:9400"}]forward_to = [prometheus.remote_write.cockpit.receiver]}prometheus.exporter.unix "node_exporter" {set_collectors = ["uname","cpu","cpufreq","loadavg","meminfo","filesystem","netdev",]}prometheus.scrape "node_exporter" {scrape_interval = "60s"targets = prometheus.exporter.unix.node_exporter.targetsforward_to = [prometheus.remote_write.cockpit.receiver]} -
Replace the values of
cockpit.endpoint.url
(https://example-afc6-4d02-a2fd-bc020bbaa7d0.metrics.cockpit.fr-par.scw.cloud/api/v1/push
) andcockpit.endpoint.headers.X-TOKEN
(example_bKNpXZZP6BSKiYzV8fiQL1yR_kP_VLB-h0tpYAkaNoVTHVm8q
) with the ones of yourgpu-instance-metrics
Cockpit data source.This configuration allows you to:
- collect performance data (using
dcgm_exporter
) from your GPU Instance. This includes information like GPU load (how much of the GPU’s processing power is being used), temperature, and other relevant metrics. - collect standard Instance metrics with
node_exporter
(CPU load, disk size, etc.) - push the collected data to your Cockpit data source (using
cockpit
).
Note- The current configuration is set to send only a limited number of metrics from
node_exporter
(the tool collecting CPU, disk, memory, etc. data). Because of this, some data might not show up on your Cockpit dashboards in Grafana when you import them. - If you want to send all available data from
node_exporter
, you need to edit its configuration. Specifically, you need to remove theset_collectors
list from the configuration. This list defines which metrics are being collected, and removing it will allow all metrics to be sent. - While removing the
set_collectors
list will provide more detailed metrics, it may come with higher resource usage and associated costs, especially if you are using a paid service for data monitoring or storage.
- collect performance data (using
-
Copy and paste the following command to create a
docker-compose.yaml
file in your Instance:touch docker-compose.yaml -
Copy and paste the following configuration inside
docker-compose.yaml
, save it and exit the file.services:dcgm_exporter:image: nvcr.io/nvidia/k8s/dcgm-exporter:3.3.0-3.2.0-ubuntu22.04deploy:resources:reservations:devices:- driver: nvidiacount: allcapabilities: [ gpu ]cap_add:- SYS_ADMINports:- "9400:9400"agent:image: grafana/alloy:latestports:- "12345:12345"volumes:- "./config.alloy:/etc/alloy/config.alloy"command: ["run","--server.http.listen-addr=0.0.0.0:12345","/etc/alloy/config.alloy",]This configuration will:
- deploy the DCGM exporter
- deploy the Grafana Alloy agent
-
Run docker services using the following command:
docker compose up
Create Cockpit dashboards in Grafana
Create a GPU metrics dashboard
-
Access the Overview tab of your Cockpit and click Open dashboards to open your Cockpit dashboards in Grafana.
-
Click the + icon in the top-right-hand corner, then click Import dashboard.
-
Copy the ID (
12219
) of the Grafana NVIDIA DCGM Exporter dashboard and paste it in the Import via grafana.com field. -
Click Load.
-
Select your Prometheus data source named
gpu-instance-metrics
, then click Import
You should see your dashboard with data such as GPU Temperature or GPU Power Usage.
If you see only an empty dashboard with the “Dashboard not Found” and “Access denied to this dashboard” error, wait a few seconds and refresh the page. Your dashboard should then display. Alternatively, you can also click the Menu icon on the left, then on Dashboards and search through your dashboards. You should see your newly created dashboard.
Create a CPU and disk metrics Cockpit dashboard in Grafana
-
Access the Overview tab of your Cockpit and click Open dashboards to open your Cockpit dashboards in Grafana.
-
Click the + icon in the top-right-hand corner, then click Import dashboard.
-
Copy the ID (
1860
) of the Node Exporter Full dashboard and paste it in the Import via grafana.com field. -
Click Load.
-
Select your Prometheus data source named
gpu-instance-metrics
, then click Import
You should now see your dashboard with data such as CPU usage and Memory Usage.
If you see only an empty dashboard with the “Dashboard not Found” and “Access denied to this dashboard” error, wait a few seconds and refresh the page. Your dashboard should then display.
If you still do not see any data, make sure that you select the gpu-instance-metrics
in the Datasource dropdown list located in the top-left-hand corner.
The current configuration of the Node Exporter agent does not include certain metrics, such as:
- Swap used: How much swap space (virtual memory) is currently being used by the system.
- Root FS used: How much of the root file system (main storage partition) is being used.
You can now find your newly created dashboards in your list of Cockpit dashboards in Grafana. This allows you to access your GPU Instances data to monitor and optimize your resources.
Going further
-
Add more metrics to your dashboards
- Connect to your GPU Instance via SSH
- Edit the
config.alloy
file and restart the agents using thedocker compose up
command - Update your Cockpit dashboards in Grafana
-
Create custom dashboards
- In Grafana explore the metrics you have sent by clicking the Menu icon on the left, then Explore.
- Select your custom data source named
gpu-instance-metrics
in the Datasource dropdown list located in the top-left-hand corner. - Click Metrics browser. You should see a list of metrics appear (for example,
DCGM_FI_DEV_GPU_TEMP
ornode_cpu_seconds_total
). - Write the desired query, click Run query to visualize data, and then Add to dashboard to add it to a new or existing dashboard.
Troubleshooting
If you encounter any issues, make sure that you meet all the requirements listed at the beginning of this tutorial.
You can run docker -v
in your terminal to check your docker version. You should see an output similar to the following:
Docker version 24.0.6, build ed223bc820