Skip to main content

CMK Metrics

info

Crusoe Managed Kubernetes (CMK) nodepool metrics are currently available for Nvidia-GPUs or non-GPU nodepools, under preview. If you require access, please contact our sales team to request access.

CMK metrics are designed to provide you with comprehensive insights into the performance and utilization of your clusters and nodes. The metrics help you monitor cluster performance, identify bottlenecks, and optimize resource utilization as needed.

Crusoe Cloud collects DCGM (GPU), CPU, memory, disk and network metrics associated with your cluster nodes. Collection requires installation of Crusoe Telemetry Agent. Crusoe Telemetry Agent is a vector.dev based daemonset and deploys 1 pod per node in your cluster. Metrics are collected and published every 60 seconds.

A subset of critical metrics can be accessed via the Crusoe Console, at the cluster level and/or at the node/VM level. To view the cluster level aggregated metrics, navigate to Orchestration Tab in the left sidebar, select your cluster, then select the Metrics sub-tab on the top navigation row. To view node specific metrics, select the nodepool on Details page of the cluster, then select the node/VM from the list to get to the node/VM page, then select the Metrics sub-tab on the top navigation row to view the detailed metrics. These metrics and additional DCGM exporter metrics are also accessible via a Prometheus-compatible query API.

Below is the list of aggregation metrics available in Crusoe Console > Cluster Metrics View. Node level metrics are available in the VM metrics view. Visit VM metrics page to find the full list node level metrics and respective query parameters.

MetricsDefinitionSuggested Query
Cluster TFLOPS (FP16)Average 16-bit GPU throughput (TFLOPS) across a nodepool, based on scaled Tensor Core utilization.avg by (nodepool) (DCGM_FI_PROF_PIPE_TENSOR_ACTIVE{cluster_id="${clusterId}"}[60s]) * theoretical max TFLOPS
Average GPU Utilization (%)The GPU utilization, averaged across all GPUs within a nodepool.avg by (nodepool) (DCGM_FI_DEV_GPU_UTIL{cluster_id="${clusterId}"}[60s])
Cumulative Power Usage (kW)Cumulative power usage summed up across all nodes in a nodepool.sum by (nodepool) (DCGM_FI_DEV_POWER_USAGE{cluster_id="${clusterId}"}[60s]) / 1000
Average GPU Memory Utilization (%)The GPU memory utilization, averaged across all GPUs within a nodepool.avg by (nodepool) (avg_over_time(DCGM_FI_DEV_GPU_UTIL{cluster_id="${clusterId}"}[60s]))
Average CPU Utilization (%)The CPU utilization, averaged across all CPU cores within a nodepool.sum by (nodepool) (rate(crusoe_vm_cpu_seconds_total{cluster_id="${clusterId}", mode!="idle"}[60s])))/(sum by (nodepool) (rate(crusoe_vm_cpu_seconds_total{cluster_id="${clusterId}"}[60s])) * 100
Aggregated VPC Network Bandwidth In (bytes per second)The rate of data received via the VPC network interface, aggregated across all nodes in a nodepool.sum by (nodepool) (rate(crusoe_vm_network_receive_bytes_total{cluster_id="${clusterId}", device!~"lo"}[60s]))
Aggregated VPC Network Bandwidth out (bytes per second)The rate of data transmitted via the VPC network interface, aggregated across all nodes in a nodepool.sum by (nodepool) (rate(crusoe_vm_network_transmit_bytes_total{cluster_id="${clusterId}", device!~"lo"}[60s]))
Uncorrectable ECC ErrorsThe rate of uncorrectable double-bit memory errors (DBE) accumulated across all GPUs in a nodepool.sum by (nodepool) (rate(DCGM_FI_DEV_ECC_DBE_VOL_TOTAL{cluster_id="${clusterId}"}[RANGE]))
Correctable ECC ErrorsThe rate of correctable single-bit memory errors (SBE) accumulated across all GPUs in a nodepool.sum by (nodepool) (rate(DCGM_FI_DEV_ECC_SBE_VOL_TOTAL{cluster_id="${clusterId}"}[60s]))
InfiniBand Throughput Rx (GiB/s)The rate of data received via InfiniBand port, aggregated across all nodes in a cluster.sum by(cluster_id) (rate(crusoe_ib_port_throughput_rx{cluster_id="${clusterId}"}[5m]) / 1000)
InfiniBand Throughput Tx (GiB/s)The rate of data transmitted via InfiniBand port, aggregated across all nodes in a cluster.sum by(cluster_id) (rate(crusoe_ib_port_throughput_tx{cluster_id="${clusterId}"}[5m]) / 1000)

XID error logs with error details are also available in Crusoe Console > Cluster Metrics View.

Installing Crusoe Telemetry Agent for CMK Cluster

Step 1: Create a CMK Cluster with supported kubernetes versions and NVIDIA GPU operator

When creating a CMK Cluster using CLI, API, or UI, make sure that you install the NVIDIA GPU operator as an add-on for the CMK cluster. Additionally, if you wish to access full list of metrics, use one of the following Kubernetes versions:

1.30.8-cmk.36    stable
1.31.7-cmk.13 stable
1.32.7-cmk.8 stable
1.33.4-cmk.4 stable

Step 2: Install or update Crusoe CLI

Please follow the instructions to install and configure Crusoe CLI. If you already have the CLI installed and configured, please ensure you upgrade to the latest version.

Step 3: Install or verify Helm installation

Follow the instructions to install Helm.

Step 4: Deploy Crusoe Telemetry Agent

Make sure you have switched your Kubernetes context to the right Cluster:

crusoe kubernetes clusters get-credentials <cluster-name> --project-id <project-id>
kubectl config current-context

Then deploy the agent:

helm repo add crusoe-telemetry-agent https://crusoecloud.github.io/crusoe-telemetry-agent/helm-charts
helm repo update
helm install crusoe-telemetry-agent crusoe-telemetry-agent/crusoe-telemetry-agent --namespace crusoe-system

Querying metrics via API

Pre-requisite: Generate your monitoring token

To generate the monitoring token, run the following CLI command:

crusoe monitoring tokens create

This command generates an API-Key that you'll use for authentication when querying the metrics API. Please use a secret or key management tool to store the token content. You will not be able to retrieve it later.

API endpoint

You can directly query the metrics API endpoint to retrieve data for a single instant or a specific time range. The API endpoint is:

https://api.crusoecloud.com/v1alpha5/projects/<project-id>/metrics/timeseries

Here is an example curl command to retrieve the most recent data point for GPU utilization, in a project:

curl -G https://api.crusoecloud.com/v1alpha5/projects/<project-id>/metrics/timeseries\?query=\
DCGM_FI_DEV_GPU_UTIL \
-H 'Authorization: Bearer <API-Key>'

Importing data into Grafana

To import data into your own Grafana instance, add a Prometheus data source with the following options:

Prometheus Server URL: https://api.crusoecloud.com/v1alpha5/projects/<project-id>/metrics/timeseries

Authentication → HTTP Headers:

Header: Authorization
Value: Bearer <API-Key>

Use the API-Key generated in the 'Generate your monitoring token' section.