Virtual Machines Metrics
Virtual Machine metrics are currently available for VMs with NVIDIA instances or non-GPU VMs, under preview. If you require access, please contact our sales team to request access.
Virtual machine (VM) metrics are designed to provide you with comprehensive insights into the performance and utilization of your virtual machines. The metrics help you monitor VM performance, identify bottlenecks, and optimize resource utilization as needed.
Crusoe Cloud collects GPU, CPU, memory, disk and network metrics as part of the VM metrics. Collection requires installation of the Crusoe Telemetry Agent per VM. VM Metrics are collected and published every 60 seconds. A subset of critical metrics, listed in the below table, can be accessed via the Crusoe Console. To view them, navigate to Compute Tab in the left sidebar, select your VM, then select the Metrics sub-tab on top navigation row. These metrics and additional DCGM exporter metrics are also accessible via a Prometheus-compatible query API.
Metrics | Definition | Suggested Query |
---|---|---|
TFlops (FP16) | The measured 16-bit floating-point GPU throughput calculated by scaling the tensor core utilziation against the hardware's theoretical maximum. | DCGM_FI_PROF_PIPE_TENSOR_ACTIVE / 100 * theoretical max TFLOPS |
GPU Utilization (%) | The percentage of time the GPU is actively executing tasks. | DCGM_FI_DEV_GPU_UTIL |
CPU Utilization (%) | The aggregated percentage of time the host's CPU cores are busy over the last 60 seconds. | (sum without(cpu, mode) (rate(crusoe_vm_cpu_seconds_total{vm_id="vm-id", mode!="idle"}[60s]))) / (sum without(cpu, mode) (rate(crusoe_vm_cpu_seconds_total{vm_id="vm-id"}[60s]))) * 100 |
GPU Memory Utilization (%) | The percentage of total dedicated GPU memory that is actively allocated and consumed by processes on the GPU. | (DCGM_FI_DEV_FB_USED / ( DCGM_FI_DEV_FB_FREE + DCGM_FI_DEV_FB_USED)) * 100 |
GPU Memory Bandwidth Utilization (%) | The percentage of the theoretical peak memory interface bandwidth being utilized for data transfer between the GPU and memory. | DCGM_FI_PROF_DRAM_ACTIVE |
System Memory Utilization (%) | The percentage of total host system RAM that is consumed. | (crusoe_vm_memory_used_bytes / crusoe_vm_memory_total_bytes) * 100 |
GPU Power Draw (W) | The current power consumption of the GPU, measured in Watts. | DCGM_FI_DEV_POWER_USAGE |
GPU Temperature (Celcius) | The current core temperature of the GPU die, measured in Celsius. | DCGM_FI_DEV_GPU_TEMP |
Tensor Core Utilization (%) | The percentage of time the Tensor pipeline is actively processing instructions over the sample period. | DCGM_FI_PROF_PIPE_TENSOR_ACTIVE * 100 |
VPC Network Bandwidth In (bytes per second) | The rate of data received by the host machine via the VPC network interface, measured in bytes per second. | crusoe_vm_network_receive_bytes_total |
VPC Network Bandwidth Out (bytes per second) | The rate of data transmitted by the host machine via the VPC network interface, measured in bytes per second. | crusoe_vm_network_transmit_bytes_total |
PCIe Bandwidth (bytes per second) | The rate of data transfer (Tx + Rx) between the CPU host memory and the GPU over the PCIe bus, measured in bytes per second. | DCGM_FI_PROF_PCIE_TX_BYTES + DCGM_FI_PROF_PCIE_RX_BYTES |
PCIe Replay Rate | The rate of error-induced packet retransmissions over the PCIe bus, measured in replays per second. High rates indicate link quality issues. | rate(DCGM_FI_DEV_PCIE_REPLAY_COUNTER[1m]) |
Uncorrectable ECC Errors | The rate of accumulation of uncorrectable single-bit memory errors (DBE) on the GPU, indicating severe hardware instability. | rate(DCGM_FI_DEV_ECC_DBE_VOL_TOTAL[1m]) |
Correctable ECC Errors | The rate of accumulation of correctable single-bit memory errors (SBE) on the GPU, indicating marginal hardware stability. | rate(DCGM_FI_DEV_ECC_SBE_VOL_TOTAL[1m]) |
SM Occupancy (%) | The average percentage of available resident warps running concurrently on the Streaming Multiprocessors (SMs) over the sample period. | DCGM_FI_PROF_SM_OCCUPANCY |
SM Active (%) | The percentage of time the Streaming Multiprocessors (SMs) were executing instructions during the sample period. | DCGM_FI_PROF_SM_ACTIVE * 100 |
SM Average Clock Speed (MHz) | The current instantaneous clock frequency of the GPU's Streaming Multiprocessors (SMs) in Megahertz (MHz). | DCGM_FI_DEV_SM_CLOCK |
XID error | The most recent unique error code emitted by the GPU driver (a hardware or software fault ID). Non-zero values indicate an error that often necessitates a driver reset or GPU restart. | DCGM_FI_DEV_XID_ERRORS |
If your VM uses NVLink enabled NVIDIA instances, the following NVLink metrics are supported.
Metrics | Definition | Suggested Query |
---|---|---|
GPU NVLink Bandwidth In (bytes per second) | The rate of data received by the GPU from other GPUs over all active NVLink connections, measured in bytes per second. | DCGM_FI_DEV_NVLINK_RX_BYTES |
GPU NVLInk Bandwidth Out (bytes per second) | The rate of data transmitted by the GPU to other GPUs over all active NVLink connections, measured in bytes per second. | DCGM_FI_DEV_NVLINK_TX_BYTES |
Pre-requisite to use Crusoe Telemetry Agent
Step 1: Install or update Crusoe CLI
Please follow the instructions to install and configure Crusoe CLI. If you already have the CLI installed and configured, please ensure you upgrade to the latest version.
Step 2: Generate your authentication token
To generate the authentication token for VM metrics, run the following CLI command:
crusoe monitoring tokens create
This command generates an API-Key
that you'll use for telemetry agent installation and authentication when querying the metrics API. Please use a secret or key management tool to store the token content. You will not be able to retrieve it later.
Install Crusoe Telemetry Agent for your VMs
Step 1: Verity your VM image version
The Crusoe Telemetry Agent only supports VMs with the following base images:
ubuntu22.04-nvidia-sxm-docker
ubuntu22.04-nvidia--pcie-docker
ubuntu22.04
(only if it is for a non-GPU VM)
Step 2: Install agent
Use instructions to install Crusoe Telemetry Agent.
Querying VM metrics via API
You can directly query the metrics API endpoint to retrieve data for a single instant or a specific time range. The API endpoint is:
https://api.crusoecloud.com/v1alpha5/projects/<project-id>/metrics/timeseries
Here is an example curl command to retrieve the most recent data point for GPU utilization, in a project:
curl -G https://api.crusoecloud.com/v1alpha5/projects/<project-id>/metrics/timeseries\?query=\
DCGM_FI_DEV_GPU_UTIL \
-H 'Authorization: Bearer <API-Key>'
Importing data into Grafana
To import data into your own Grafana instance, add a Prometheus data source with the following options:
Prometheus Server URL: https://api.crusoecloud.com/v1alpha5/projects/<project-id>/metrics/timeseries
Authentication → HTTP Headers:
Header: Authorization
Value: Bearer <API-Key>
Use the API-Key
generated in the 'Generating an Authentication Token' section.