CMK Metrics

Crusoe Managed Kubernetes (CMK) node pool metrics provide you with comprehensive insights into the performance and utilization of your clusters and nodes. These metrics help you monitor cluster performance, identify bottlenecks, and optimize resource utilization. Data is generally available for NVIDIA-GPUs or non-GPU node pools.

Crusoe Cloud collects DCGM (GPU), CPU, memory, disk and network metrics associated with your cluster nodes. Data collection requires installing the Crusoe Watch Agent per cluster. This agent is a vector.dev-based Daemonset and deploys 1 pod per node in your cluster. Metrics are collected and published in 60-second intervals, and are retained for 30 days.

A subset of critical metrics can be accessed via the Crusoe Console, at the cluster level and/or at the node/VM level. To view the cluster level aggregated metrics, navigate to Orchestration Tab in the left navigation menu, select your cluster, then select the Metrics Tab on the top navigation menu. To view node specific metrics, select the node pool on Details page of the cluster, then select the node/VM from the list to get to the node/VM page, then select the Metrics Tab on the top navigation menu to view the detailed metrics. These metrics and additional DCGM exporter metrics are also accessible via a Prometheus-compatible query API.

Below is the list of aggregation metrics available in Crusoe Console > Cluster Metrics View. Node level metrics are available in the VM metrics view. Visit VM metrics page to find the full list node level metrics and respective query parameters.

Metrics	Definition	Suggested Query
Cluster TFLOPS (FP16)	Average 16-bit GPU throughput (TFLOPS) across a node pool, based on scaled Tensor Core utilization.	`avg by (nodepool) (DCGM_FI_PROF_PIPE_TENSOR_ACTIVE{cluster_id="${clusterId}"}[60s])` * theoretical max TFLOPS
Average GPU Utilization (%)	The GPU utilization, averaged across all GPUs within a node pool.	`avg by (nodepool) (DCGM_FI_DEV_GPU_UTIL{cluster_id="${clusterId}"}[60s])`
Cumulative Power Usage (kW)	Cumulative power usage summed up across all nodes in a node pool.	`sum by (nodepool) (DCGM_FI_DEV_POWER_USAGE{cluster_id="${clusterId}"}[60s]) / 1000`
Average GPU Memory Utilization (%)	The GPU memory utilization, averaged across all GPUs within a node pool.	`avg by (nodepool) (avg_over_time(DCGM_FI_DEV_GPU_UTIL{cluster_id="${clusterId}"}[60s]))`
Average CPU Utilization (%)	The CPU utilization, averaged across all CPU cores within a node pool.	`sum by (nodepool) (rate(crusoe_vm_cpu_seconds_total{cluster_id="${clusterId}", mode!="idle"}[60s])))/(sum by (nodepool) (rate(crusoe_vm_cpu_seconds_total{cluster_id="${clusterId}"}[60s])) * 100`
Aggregated VPC Network Bandwidth In (bytes per second)	The rate of data received via the VPC network interface, aggregated across all nodes in a node pool.	`sum by (nodepool) (rate(crusoe_vm_network_receive_bytes_total{cluster_id="${clusterId}", device!~"lo"}[60s]))`
Aggregated VPC Network Bandwidth out (bytes per second)	The rate of data transmitted via the VPC network interface, aggregated across all nodes in a node pool.	`sum by (nodepool) (rate(crusoe_vm_network_transmit_bytes_total{cluster_id="${clusterId}", device!~"lo"}[60s]))`
Uncorrectable ECC Error Rate	The rate of uncorrectable double-bit memory errors (DBE) accumulated across all GPUs in a node pool.	`sum by (nodepool) (rate(DCGM_FI_DEV_ECC_DBE_VOL_TOTAL{cluster_id="${clusterId}"}[RANGE]))`
Correctable ECC Error Rate	The rate of correctable single-bit memory errors (SBE) accumulated across all GPUs in a node pool.	`sum by (nodepool) (rate(DCGM_FI_DEV_ECC_SBE_VOL_TOTAL{cluster_id="${clusterId}"}[60s]))`
InfiniBand Throughput Rx (bytes per second)	The rate of data received via InfiniBand port, aggregated across all nodes in a cluster.	`sum by(cluster_id) (rate(crusoe_ib_port_throughput_rx{cluster_id="${clusterId}"}[5m]) / 1000)`
InfiniBand Throughput Tx (bytes per second)	The rate of data transmitted via InfiniBand port, aggregated across all nodes in a cluster.	`sum by(cluster_id) (rate(crusoe_ib_port_throughput_tx{cluster_id="${clusterId}"}[5m]) / 1000)`

XID error logs with error details are also available in Crusoe Console > Cluster Metrics View.

Installing Crusoe Watch Agent for CMK Cluster

Step 1: Create a CMK Cluster with supported kubernetes versions and NVIDIA GPU operator

When creating a CMK Cluster using CLI, API, or UI, make sure that you install the NVIDIA GPU operator as an add-on for the CMK cluster. Additionally, if you wish to access full list of metrics, use one of the following Kubernetes versions:

30.8-cmk.36    stable
31.7-cmk.13    stable
32.7-cmk.16    stable
33.4-cmk.13    stable

Step 2: Install or update Crusoe CLI

Please follow the instructions to install and configure Crusoe CLI. If you already have the CLI installed and configured, please ensure you upgrade to the latest version.

Step 3: Install or verify Helm installation

Follow the instructions to install Helm.

Step 4: Deploy Crusoe Watch Agent

Make sure you have switched your Kubernetes context to the right Cluster:

crusoe kubernetes clusters get-credentials <cluster-name> --project-id <project-id>
kubectl config current-context

Then deploy the agent:

helm repo add crusoe-watch-agent https://crusoecloud.github.io/crusoe-watch-agent/k8s/helm-charts
helm repo update
helm install crusoe-watch-agent crusoe-watch-agent/crusoe-watch-agent --namespace crusoe-system

To upgrade an existing installation to the latest version:

helm search repo crusoe-watch-agent/crusoe-watch-agent --versions | head -n 2 #check latest agent version
helm repo update
helm upgrade crusoe-watch-agent crusoe-watch-agent/crusoe-watch-agent --namespace crusoe-system

To verify the installation:

kubectl get pods -n crusoe-system

Querying metrics via API

Pre-requisite: Generate your monitoring token

To generate the monitoring token, run the following CLI command:

crusoe monitoring tokens create

This command generates an API-Key that you'll use for authentication when querying the metrics API. Please use a secret or key management tool to store the token content. You will not be able to retrieve it later.

API endpoint

You can directly query the metrics API endpoint to retrieve data for a single instant or a specific time range. The API endpoint is:

https://api.crusoecloud.com/v1alpha5/projects/<project-id>/metrics/timeseries

Here is an example curl command to retrieve the most recent data point for GPU utilization, in a project:

curl -G https://api.crusoecloud.com/v1alpha5/projects/<project-id>/metrics/timeseries\?query=\
DCGM_FI_DEV_GPU_UTIL \
-H 'Authorization: Bearer <API-Key>'

Importing data into Grafana

To import data into your own Grafana instance, add a Prometheus data source with the following options:

Prometheus Server URL: https://api.crusoecloud.com/v1alpha5/projects/<project-id>/metrics/timeseries

Authentication → HTTP Headers:

Header: Authorization
Value: Bearer <API-Key>

Use the API-Key generated in the 'Generate your monitoring token' section.

Known issues

Parsing errors caused by special characters

To prevent parsing errors caused by special characters like the dollar sign ($) in the CMK monitoring token during Helm chart deployment, reference the token from a Kubernetes Secret using the secretKeyRef mechanism. First, store the token securely in a Kubernetes Secret object. Then, configure your deployment's environment variables to reference the Secret; this ensures the raw, unescaped token is injected directly into your application container at runtime. Below is an example configuration snippet:

# In your application's Deployment or Pod manifest
env:
  - name: CRUSOE_MONITORING_TOKEN
    valueFrom:
      secretKeyRef:
        name: crusoe-monitoring-token # This must match the name of your Kubernetes Secret
        key: CRUSOE_MONITORING_TOKEN # This must match the key used inside the Secret object

Installing Crusoe Watch Agent for CMK Cluster​

Step 1: Create a CMK Cluster with supported kubernetes versions and NVIDIA GPU operator​

Step 2: Install or update Crusoe CLI​

Step 3: Install or verify Helm installation​

Step 4: Deploy Crusoe Watch Agent​

Querying metrics via API​

Pre-requisite: Generate your monitoring token​

API endpoint​

Importing data into Grafana​

Known issues​

Parsing errors caused by special characters​