Virtual Machines Metrics

Virtual machine (VM) metrics provide you with comprehensive insights into the performance and utilization of your environment. The metrics help you monitor VM health, identify bottlenecks, and optimize resource utilization. VM metrics are generally available for VMs with NVIDIA GPU-accelerated instances and non-GPU VMs.

Crusoe Cloud collects GPU, CPU, memory, disk and network metrics as part of its VM monitoring suite. Data collection requires installing the Crusoe Watch Agent on each VM. VM metrics are collected and published in 60-second intervals, and are retained for 30 days. While a full dataset is available via the Prometheus-compatible query API, a subset of critical metrics, listed in the below table, can be accessed via the Crusoe Console. To view them, navigate to Compute Tab in the left navigation bar, select your VM, then select the Metrics Tab on the top navigation bar.

Metrics	Definition	Suggested Query
TFlops (FP16)	The measured 16-bit floating-point GPU throughput calculated by scaling the tensor core utilziation against the hardware's theoretical maximum.	DCGM_FI_PROF_PIPE_TENSOR_ACTIVE / 100 * theoretical max TFLOPS
GPU Utilization (%)	The percentage of time the GPU is actively executing tasks.	DCGM_FI_DEV_GPU_UTIL
CPU Utilization (%)	The aggregated percentage of time the host's CPU cores are busy over the last 60 seconds.	(sum without(cpu, mode) (rate(crusoe_vm_cpu_seconds_total{vm_id="vm-id", mode!="idle"}[60s]))) / (sum without(cpu, mode) (rate(crusoe_vm_cpu_seconds_total{vm_id="vm-id"}[60s]))) * 100
GPU Memory Utilization (%)	The percentage of total dedicated GPU memory that is actively allocated and consumed by processes on the GPU.	(DCGM_FI_DEV_FB_USED / ( DCGM_FI_DEV_FB_FREE + DCGM_FI_DEV_FB_USED)) * 100
GPU Memory Bandwidth Utilization (%)	The percentage of the theoretical peak memory interface bandwidth being utilized for data transfer between the GPU and memory.	DCGM_FI_PROF_DRAM_ACTIVE
System Memory Utilization (%)	The percentage of total host system RAM that is consumed.	(crusoe_vm_memory_used_bytes / crusoe_vm_memory_total_bytes) * 100
GPU Power Draw (W)	The current power consumption of the GPU, measured in Watts.	DCGM_FI_DEV_POWER_USAGE
GPU Temperature (Celcius)	The current core temperature of the GPU die, measured in Celsius.	DCGM_FI_DEV_GPU_TEMP
Tensor Core Utilization (%)	The percentage of time the Tensor pipeline is actively processing instructions over the sample period.	DCGM_FI_PROF_PIPE_TENSOR_ACTIVE * 100
VPC Network Bandwidth In (bytes per second)	The rate of data received by the host machine via the VPC network interface, measured in bytes per second.	crusoe_vm_network_receive_bytes_total
VPC Network Bandwidth Out (bytes per second)	The rate of data transmitted by the host machine via the VPC network interface, measured in bytes per second.	crusoe_vm_network_transmit_bytes_total
PCIe Bandwidth (bytes per second)	The rate of data transfer (Tx + Rx) between the CPU host memory and the GPU over the PCIe bus, measured in bytes per second.	DCGM_FI_PROF_PCIE_TX_BYTES + DCGM_FI_PROF_PCIE_RX_BYTES
PCIe Replay Rate	The rate of error-induced packet retransmissions over the PCIe bus, measured in replays per second. High rates indicate link quality issues.	rate(DCGM_FI_DEV_PCIE_REPLAY_COUNTER[1m])
Uncorrectable ECC Error Rate	The rate of accumulation of uncorrectable double-bit memory errors (DBE) on the GPU, indicating severe hardware instability.	rate(DCGM_FI_DEV_ECC_DBE_VOL_TOTAL[1m])
Correctable ECC Error Rate	The rate of accumulation of correctable single-bit memory errors (SBE) on the GPU, indicating marginal hardware stability.	rate(DCGM_FI_DEV_ECC_SBE_VOL_TOTAL[1m])
SM Occupancy (%)	The average percentage of available resident warps running concurrently on the Streaming Multiprocessors (SMs) over the sample period.	DCGM_FI_PROF_SM_OCCUPANCY
SM Active (%)	The percentage of time the Streaming Multiprocessors (SMs) were executing instructions during the sample period.	DCGM_FI_PROF_SM_ACTIVE * 100
SM Average Clock Speed (MHz)	The current instantaneous clock frequency of the GPU's Streaming Multiprocessors (SMs) in Megahertz (MHz).	DCGM_FI_DEV_SM_CLOCK
GPU XID error	The most recent unique error code emitted by the GPU driver (a hardware or software fault ID). Non-zero values indicate an error that typically requires a driver reset or GPU restart.	DCGM_FI_DEV_XID_ERRORS

If your VM uses NVLink-enabled NVIDIA instances, the following NVLink metrics are supported in Crusoe Console.

Metrics	Definition	Suggested Query
GPU NVLink Bandwidth In (bytes per second)	The rate of data received by the GPU from other GPUs over all active NVLink connections, measured in bytes per second.	DCGM_FI_DEV_NVLINK_RX_BYTES
GPU NVLInk Bandwidth Out (bytes per second)	The rate of data transmitted by the GPU to other GPUs over all active NVLink connections, measured in bytes per second.	DCGM_FI_DEV_NVLINK_TX_BYTES

Pre-requisite to use Crusoe Watch Agent

Step 1: Install or update Crusoe CLI

Please follow the instructions to install and configure Crusoe CLI. If you already have the CLI installed and configured, please ensure you upgrade to the latest version.

Step 2: Generate your monitoring token

To generate the monitoring token for VM metrics, run the following CLI command:

crusoe monitoring tokens create

This command generates an API-Key that you'll use for the agent installation and authentication when querying the metrics API. Please use a secret or key management tool to store the token content. You will not be able to retrieve it later.

Install Crusoe Watch Agent for your VMs

Step 1: Verify your VM image version

The Crusoe Watch Agent only supports VMs with the following base images:

ubuntu22.04-nvidia-sxm-docker
ubuntu22.04-nvidia-pcie-docker
ubuntu22.04 (only if it is for a non-GPU VM)
ubuntu24.04 (only if it is for a non-GPU VM)

Step 2: Install agent

Use instructions to install Crusoe Watch Agent. If you prefer to use Ansible, use the Ansible Deployment Guide.

Querying VM metrics via API

You can directly query the metrics API endpoint to retrieve data for a single instant or a specific time range. The API endpoint is:

https://api.crusoecloud.com/v1alpha5/projects/<project-id>/metrics/timeseries

Here is an example curl command to retrieve the most recent data point for GPU utilization, in a project:

curl -G https://api.crusoecloud.com/v1alpha5/projects/<project-id>/metrics/timeseries\?query=\
DCGM_FI_DEV_GPU_UTIL \
-H 'Authorization: Bearer <API-Key>'

Importing data into Grafana

To import data into your own Grafana instance, add a Prometheus data source with the following options:

Prometheus Server URL: https://api.crusoecloud.com/v1alpha5/projects/<project-id>/metrics/timeseries

Authentication → HTTP Headers:

Header: Authorization
Value: Bearer <API-Key>

Use the API-Key generated in the 'Generate your monitoring token' section.

Known issues

Correctable ECC errors not emitted correctly for multi-GPU VMs and clusters

Correctable ECC errors (DCGM_FI_DEV_ECC_SBE_VOL_TOTAL) may not be emitted with all timeseries for multi-GPU VMs and clusters by the DCGM exporter due to a known NVIDIA NVLink metrics bug (GitHub Issue). We are working on a long-term fix. In the interim, you can manually force DCGM to start monitoring this field by running the following command within the VM:

dcgmi dmon -e 310

Clusters with Slurm images not retrieving metrics correctly due to pre-installed dcgm-exporter

If you have a pre-installed dcgm-exporter systemd service, it could conflict with the dcgm-exporter that would be installed as part of installing crusoe watch agent, causing metrics collection failures. To prevent this issue, use --replace-dcgm-exporter to replace your existing dcgm-exporter with the Crusoe version for full metrics collection, when setting up the crusoe watch agent. The service_name is an optional field with default to be dcgm-exporter.service.

sudo crusoe-watch-agent --replace-dcgm-exporter [SERVICE_NAME]

Virtual Machines Metrics

Pre-requisite to use Crusoe Watch Agent​

Step 1: Install or update Crusoe CLI​

Step 2: Generate your monitoring token​

Install Crusoe Watch Agent for your VMs​

Step 1: Verify your VM image version​

Step 2: Install agent​

Querying VM metrics via API​

Importing data into Grafana​

Known issues​

Correctable ECC errors not emitted correctly for multi-GPU VMs and clusters​

Clusters with Slurm images not retrieving metrics correctly due to pre-installed dcgm-exporter​