Skip to main content

Virtual Machines Metrics

info

Virtual Machine metrics are currently available for VMs with NVIDIA instances or non-GPU VMs, under preview. If you require access, please contact our sales team to request access.

Virtual machine (VM) metrics are designed to provide you with comprehensive insights into the performance and utilization of your virtual machines. The metrics help you monitor VM performance, identify bottlenecks, and optimize resource utilization as needed.

Crusoe Cloud collects GPU, CPU, memory, disk and network metrics as part of the VM metrics. Collection requires installation of the Crusoe Telemetry Agent per VM. VM Metrics are collected and published every 60 seconds. A subset of critical metrics, listed in the below table, can be accessed via the Crusoe Console. To view them, navigate to Compute Tab in the left sidebar, select your VM, then select the Metrics sub-tab on top navigation row. These metrics and additional DCGM exporter metrics are also accessible via a Prometheus-compatible query API.

MetricsDefinitionSuggested Query
TFlops (FP16)The measured 16-bit floating-point GPU throughput calculated by scaling the tensor core utilziation against the hardware's theoretical maximum.DCGM_FI_PROF_PIPE_TENSOR_ACTIVE / 100 * theoretical max TFLOPS
GPU Utilization (%)The percentage of time the GPU is actively executing tasks.DCGM_FI_DEV_GPU_UTIL
CPU Utilization (%)The aggregated percentage of time the host's CPU cores are busy over the last 60 seconds.(sum without(cpu, mode) (rate(crusoe_vm_cpu_seconds_total{vm_id="vm-id", mode!="idle"}[60s]))) / (sum without(cpu, mode) (rate(crusoe_vm_cpu_seconds_total{vm_id="vm-id"}[60s]))) * 100
GPU Memory Utilization (%)The percentage of total dedicated GPU memory that is actively allocated and consumed by processes on the GPU.(DCGM_FI_DEV_FB_USED / ( DCGM_FI_DEV_FB_FREE + DCGM_FI_DEV_FB_USED)) * 100
GPU Memory Bandwidth Utilization (%)The percentage of the theoretical peak memory interface bandwidth being utilized for data transfer between the GPU and memory.DCGM_FI_PROF_DRAM_ACTIVE
System Memory Utilization (%)The percentage of total host system RAM that is consumed.(crusoe_vm_memory_used_bytes / crusoe_vm_memory_total_bytes) * 100
GPU Power Draw (W)The current power consumption of the GPU, measured in Watts.DCGM_FI_DEV_POWER_USAGE
GPU Temperature (Celcius)The current core temperature of the GPU die, measured in Celsius.DCGM_FI_DEV_GPU_TEMP
Tensor Core Utilization (%)The percentage of time the Tensor pipeline is actively processing instructions over the sample period.DCGM_FI_PROF_PIPE_TENSOR_ACTIVE * 100
VPC Network Bandwidth In (bytes per second)The rate of data received by the host machine via the VPC network interface, measured in bytes per second.crusoe_vm_network_receive_bytes_total
VPC Network Bandwidth Out (bytes per second)The rate of data transmitted by the host machine via the VPC network interface, measured in bytes per second.crusoe_vm_network_transmit_bytes_total
PCIe Bandwidth (bytes per second)The rate of data transfer (Tx + Rx) between the CPU host memory and the GPU over the PCIe bus, measured in bytes per second.DCGM_FI_PROF_PCIE_TX_BYTES + DCGM_FI_PROF_PCIE_RX_BYTES
PCIe Replay RateThe rate of error-induced packet retransmissions over the PCIe bus, measured in replays per second. High rates indicate link quality issues.rate(DCGM_FI_DEV_PCIE_REPLAY_COUNTER[1m])
Uncorrectable ECC ErrorsThe rate of accumulation of uncorrectable single-bit memory errors (DBE) on the GPU, indicating severe hardware instability.rate(DCGM_FI_DEV_ECC_DBE_VOL_TOTAL[1m])
Correctable ECC ErrorsThe rate of accumulation of correctable single-bit memory errors (SBE) on the GPU, indicating marginal hardware stability.rate(DCGM_FI_DEV_ECC_SBE_VOL_TOTAL[1m])
SM Occupancy (%)The average percentage of available resident warps running concurrently on the Streaming Multiprocessors (SMs) over the sample period.DCGM_FI_PROF_SM_OCCUPANCY
SM Active (%)The percentage of time the Streaming Multiprocessors (SMs) were executing instructions during the sample period.DCGM_FI_PROF_SM_ACTIVE * 100
SM Average Clock Speed (MHz)The current instantaneous clock frequency of the GPU's Streaming Multiprocessors (SMs) in Megahertz (MHz).DCGM_FI_DEV_SM_CLOCK
XID errorThe most recent unique error code emitted by the GPU driver (a hardware or software fault ID). Non-zero values indicate an error that often necessitates a driver reset or GPU restart.DCGM_FI_DEV_XID_ERRORS

If your VM uses NVLink enabled NVIDIA instances, the following NVLink metrics are supported.

MetricsDefinitionSuggested Query
GPU NVLink Bandwidth In (bytes per second)The rate of data received by the GPU from other GPUs over all active NVLink connections, measured in bytes per second.DCGM_FI_DEV_NVLINK_RX_BYTES
GPU NVLInk Bandwidth Out (bytes per second)The rate of data transmitted by the GPU to other GPUs over all active NVLink connections, measured in bytes per second.DCGM_FI_DEV_NVLINK_TX_BYTES

Pre-requisite to use Crusoe Telemetry Agent

Step 1: Install or update Crusoe CLI

Please follow the instructions to install and configure Crusoe CLI. If you already have the CLI installed and configured, please ensure you upgrade to the latest version.

Step 2: Generate your authentication token

To generate the authentication token for VM metrics, run the following CLI command:

crusoe monitoring tokens create

This command generates an API-Key that you'll use for telemetry agent installation and authentication when querying the metrics API. Please use a secret or key management tool to store the token content. You will not be able to retrieve it later.

Install Crusoe Telemetry Agent for your VMs

Step 1: Verity your VM image version

The Crusoe Telemetry Agent only supports VMs with the following base images:

  • ubuntu22.04-nvidia-sxm-docker
  • ubuntu22.04-nvidia--pcie-docker
  • ubuntu22.04 (only if it is for a non-GPU VM)

Step 2: Install agent

Use instructions to install Crusoe Telemetry Agent.

Querying VM metrics via API

You can directly query the metrics API endpoint to retrieve data for a single instant or a specific time range. The API endpoint is:

https://api.crusoecloud.com/v1alpha5/projects/<project-id>/metrics/timeseries

Here is an example curl command to retrieve the most recent data point for GPU utilization, in a project:

curl -G https://api.crusoecloud.com/v1alpha5/projects/<project-id>/metrics/timeseries\?query=\
DCGM_FI_DEV_GPU_UTIL \
-H 'Authorization: Bearer <API-Key>'

Importing data into Grafana

To import data into your own Grafana instance, add a Prometheus data source with the following options:

Prometheus Server URL: https://api.crusoecloud.com/v1alpha5/projects/<project-id>/metrics/timeseries

Authentication → HTTP Headers:

Header: Authorization
Value: Bearer <API-Key>

Use the API-Key generated in the 'Generating an Authentication Token' section.