Virtual Machines Metrics
Virtual machine (VM) metrics provide you with comprehensive insights into the performance and utilization of your environment. The metrics help you monitor VM health, identify bottlenecks, and optimize resource utilization. VM metrics are generally available for VMs with NVIDIA GPU-accelerated instances and non-GPU VMs.
Crusoe Cloud collects GPU, CPU, memory, disk and network metrics as part of its VM monitoring suite. Data collection requires installing the Crusoe Watch Agent on each VM. VM metrics are collected and published in 60-second intervals, and are retained for 30 days. While a full dataset is available via the Prometheus-compatible query API, a subset of critical metrics, listed in the below table, can be accessed via the Crusoe Console. To view them, navigate to Compute Tab in the left navigation bar, select your VM, then select the Metrics Tab on the top navigation bar.
| Metrics | Definition | Suggested Query |
|---|---|---|
| TFlops (FP16) | The measured 16-bit floating-point GPU throughput calculated by scaling the tensor core utilziation against the hardware's theoretical maximum. | DCGM_FI_PROF_PIPE_TENSOR_ACTIVE / 100 * theoretical max TFLOPS |
| GPU Utilization (%) | The percentage of time the GPU is actively executing tasks. | DCGM_FI_DEV_GPU_UTIL |
| CPU Utilization (%) | The aggregated percentage of time the host's CPU cores are busy over the last 60 seconds. | (sum without(cpu, mode) (rate(crusoe_vm_cpu_seconds_total{vm_id="vm-id", mode!="idle"}[60s]))) / (sum without(cpu, mode) (rate(crusoe_vm_cpu_seconds_total{vm_id="vm-id"}[60s]))) * 100 |
| GPU Memory Utilization (%) | The percentage of total dedicated GPU memory that is actively allocated and consumed by processes on the GPU. | (DCGM_FI_DEV_FB_USED / ( DCGM_FI_DEV_FB_FREE + DCGM_FI_DEV_FB_USED)) * 100 |
| GPU Memory Bandwidth Utilization (%) | The percentage of the theoretical peak memory interface bandwidth being utilized for data transfer between the GPU and memory. | DCGM_FI_PROF_DRAM_ACTIVE |
| System Memory Utilization (%) | The percentage of total host system RAM that is consumed. | (crusoe_vm_memory_used_bytes / crusoe_vm_memory_total_bytes) * 100 |
| GPU Power Draw (W) | The current power consumption of the GPU, measured in Watts. | DCGM_FI_DEV_POWER_USAGE |
| GPU Temperature (Celcius) | The current core temperature of the GPU die, measured in Celsius. | DCGM_FI_DEV_GPU_TEMP |
| Tensor Core Utilization (%) | The percentage of time the Tensor pipeline is actively processing instructions over the sample period. | DCGM_FI_PROF_PIPE_TENSOR_ACTIVE * 100 |
| VPC Network Bandwidth In (bytes per second) | The rate of data received by the host machine via the VPC network interface, measured in bytes per second. | crusoe_vm_network_receive_bytes_total |
| VPC Network Bandwidth Out (bytes per second) | The rate of data transmitted by the host machine via the VPC network interface, measured in bytes per second. | crusoe_vm_network_transmit_bytes_total |
| PCIe Bandwidth (bytes per second) | The rate of data transfer (Tx + Rx) between the CPU host memory and the GPU over the PCIe bus, measured in bytes per second. | DCGM_FI_PROF_PCIE_TX_BYTES + DCGM_FI_PROF_PCIE_RX_BYTES |
| PCIe Replay Rate | The rate of error-induced packet retransmissions over the PCIe bus, measured in replays per second. High rates indicate link quality issues. | rate(DCGM_FI_DEV_PCIE_REPLAY_COUNTER[1m]) |
| Uncorrectable ECC Error Rate | The rate of accumulation of uncorrectable double-bit memory errors (DBE) on the GPU, indicating severe hardware instability. | rate(DCGM_FI_DEV_ECC_DBE_VOL_TOTAL[1m]) |
| Correctable ECC Error Rate | The rate of accumulation of correctable single-bit memory errors (SBE) on the GPU, indicating marginal hardware stability. | rate(DCGM_FI_DEV_ECC_SBE_VOL_TOTAL[1m]) |
| SM Occupancy (%) | The average percentage of available resident warps running concurrently on the Streaming Multiprocessors (SMs) over the sample period. | DCGM_FI_PROF_SM_OCCUPANCY |
| SM Active (%) | The percentage of time the Streaming Multiprocessors (SMs) were executing instructions during the sample period. | DCGM_FI_PROF_SM_ACTIVE * 100 |
| SM Average Clock Speed (MHz) | The current instantaneous clock frequency of the GPU's Streaming Multiprocessors (SMs) in Megahertz (MHz). | DCGM_FI_DEV_SM_CLOCK |
| GPU XID error | The most recent unique error code emitted by the GPU driver (a hardware or software fault ID). Non-zero values indicate an error that typically requires a driver reset or GPU restart. | DCGM_FI_DEV_XID_ERRORS |
If your VM uses NVLink-enabled NVIDIA instances, the following NVLink metrics are supported in Crusoe Console.
| Metrics | Definition | Suggested Query |
|---|---|---|
| GPU NVLink Bandwidth In (bytes per second) | The rate of data received by the GPU from other GPUs over all active NVLink connections, measured in bytes per second. | DCGM_FI_DEV_NVLINK_RX_BYTES |
| GPU NVLInk Bandwidth Out (bytes per second) | The rate of data transmitted by the GPU to other GPUs over all active NVLink connections, measured in bytes per second. | DCGM_FI_DEV_NVLINK_TX_BYTES |
Pre-requisite to use Crusoe Watch Agent
Step 1: Install or update Crusoe CLI
Please follow the instructions to install and configure Crusoe CLI. If you already have the CLI installed and configured, please ensure you upgrade to the latest version.
Step 2: Generate your monitoring token
To generate the monitoring token for VM metrics, run the following CLI command:
crusoe monitoring tokens create
This command generates an API-Key that you'll use for the agent installation and authentication when querying the metrics API. Please use a secret or key management tool to store the token content. You will not be able to retrieve it later.
Install Crusoe Watch Agent for your VMs
Step 1: Verify your VM image version
The Crusoe Watch Agent only supports VMs with the following base images:
ubuntu22.04-nvidia-sxm-dockerubuntu22.04-nvidia-pcie-dockerubuntu22.04(only if it is for a non-GPU VM)ubuntu24.04(only if it is for a non-GPU VM)
Step 2: Install agent
Use instructions to install Crusoe Watch Agent. If you prefer to use Ansible, use the Ansible Deployment Guide.
Querying VM metrics via API
You can directly query the metrics API endpoint to retrieve data for a single instant or a specific time range. The API endpoint is:
https://api.crusoecloud.com/v1alpha5/projects/<project-id>/metrics/timeseries
Here is an example curl command to retrieve the most recent data point for GPU utilization, in a project:
curl -G https://api.crusoecloud.com/v1alpha5/projects/<project-id>/metrics/timeseries\?query=\
DCGM_FI_DEV_GPU_UTIL \
-H 'Authorization: Bearer <API-Key>'
Importing data into Grafana
To import data into your own Grafana instance, add a Prometheus data source with the following options:
Prometheus Server URL: https://api.crusoecloud.com/v1alpha5/projects/<project-id>/metrics/timeseries
Authentication → HTTP Headers:
Header: Authorization
Value: Bearer <API-Key>
Use the API-Key generated in the 'Generate your monitoring token' section.
Known issues
Correctable ECC errors not emitted correctly for multi-GPU VMs and clusters
Correctable ECC errors (DCGM_FI_DEV_ECC_SBE_VOL_TOTAL) may not be emitted with all timeseries for multi-GPU VMs and clusters by the DCGM exporter due to a known NVIDIA NVLink metrics bug (GitHub Issue). We are working on a long-term fix. In the interim, you can manually force DCGM to start monitoring this field by running the following command within the VM:
dcgmi dmon -e 310
Clusters with Slurm images not retrieving metrics correctly due to pre-installed dcgm-exporter
If you have a pre-installed dcgm-exporter systemd service, it could conflict with the dcgm-exporter that would be installed as part of installing crusoe watch agent, causing metrics collection failures. To prevent this issue, use --replace-dcgm-exporter to replace your existing dcgm-exporter with the Crusoe version for full metrics collection, when setting up the crusoe watch agent. The service_name is an optional field with default to be dcgm-exporter.service.
sudo crusoe-watch-agent --replace-dcgm-exporter [SERVICE_NAME]