CMK Metrics
Crusoe Managed Kubernetes (CMK) node pool metrics provide you with comprehensive insights into the performance and utilization of your clusters and nodes. These metrics help you monitor cluster performance, identify bottlenecks, and optimize resource utilization. Data is generally available for NVIDIA-GPUs or non-GPU node pools.
Crusoe Cloud collects DCGM (GPU), CPU, memory, disk and network metrics associated with your cluster nodes. Data collection requires installing the Crusoe Watch Agent per cluster. This agent is a vector.dev-based Daemonset and deploys 1 pod per node in your cluster. Metrics are collected and published in 60-second intervals, and are retained for 30 days.
A subset of critical metrics can be accessed via the Crusoe Console, at the cluster level and/or at the node/VM level. To view the cluster level aggregated metrics, navigate to Orchestration Tab in the left navigation menu, select your cluster, then select the Metrics Tab on the top navigation menu. To view node specific metrics, select the node pool on Details page of the cluster, then select the node/VM from the list to get to the node/VM page, then select the Metrics Tab on the top navigation menu to view the detailed metrics. These metrics and additional DCGM exporter metrics are also accessible via a Prometheus-compatible query API.
Below is the list of aggregation metrics available in Crusoe Console > Cluster Metrics View. Node level metrics are available in the VM metrics view. Visit VM metrics page to find the full list node level metrics and respective query parameters.
| Metrics | Definition | Suggested Query |
|---|---|---|
| Cluster TFLOPS (FP16) | Average 16-bit GPU throughput (TFLOPS) across a node pool, based on scaled Tensor Core utilization. | avg by (nodepool) (DCGM_FI_PROF_PIPE_TENSOR_ACTIVE{cluster_id="${clusterId}"}[60s]) * theoretical max TFLOPS |
| Average GPU Utilization (%) | The GPU utilization, averaged across all GPUs within a node pool. | avg by (nodepool) (DCGM_FI_DEV_GPU_UTIL{cluster_id="${clusterId}"}[60s]) |
| Cumulative Power Usage (kW) | Cumulative power usage summed up across all nodes in a node pool. | sum by (nodepool) (DCGM_FI_DEV_POWER_USAGE{cluster_id="${clusterId}"}[60s]) / 1000 |
| Average GPU Memory Utilization (%) | The GPU memory utilization, averaged across all GPUs within a node pool. | avg by (nodepool) (avg_over_time(DCGM_FI_DEV_GPU_UTIL{cluster_id="${clusterId}"}[60s])) |
| Average CPU Utilization (%) | The CPU utilization, averaged across all CPU cores within a node pool. | sum by (nodepool) (rate(crusoe_vm_cpu_seconds_total{cluster_id="${clusterId}", mode!="idle"}[60s])))/(sum by (nodepool) (rate(crusoe_vm_cpu_seconds_total{cluster_id="${clusterId}"}[60s])) * 100 |
| Aggregated VPC Network Bandwidth In (bytes per second) | The rate of data received via the VPC network interface, aggregated across all nodes in a node pool. | sum by (nodepool) (rate(crusoe_vm_network_receive_bytes_total{cluster_id="${clusterId}", device!~"lo"}[60s])) |
| Aggregated VPC Network Bandwidth out (bytes per second) | The rate of data transmitted via the VPC network interface, aggregated across all nodes in a node pool. | sum by (nodepool) (rate(crusoe_vm_network_transmit_bytes_total{cluster_id="${clusterId}", device!~"lo"}[60s])) |
| Uncorrectable ECC Error Rate | The rate of uncorrectable double-bit memory errors (DBE) accumulated across all GPUs in a node pool. | sum by (nodepool) (rate(DCGM_FI_DEV_ECC_DBE_VOL_TOTAL{cluster_id="${clusterId}"}[RANGE])) |
| Correctable ECC Error Rate | The rate of correctable single-bit memory errors (SBE) accumulated across all GPUs in a node pool. | sum by (nodepool) (rate(DCGM_FI_DEV_ECC_SBE_VOL_TOTAL{cluster_id="${clusterId}"}[60s])) |
| InfiniBand Throughput Rx (bytes per second) | The rate of data received via InfiniBand port, aggregated across all nodes in a cluster. | sum by(cluster_id) (rate(crusoe_ib_port_throughput_rx{cluster_id="${clusterId}"}[5m]) / 1000) |
| InfiniBand Throughput Tx (bytes per second) | The rate of data transmitted via InfiniBand port, aggregated across all nodes in a cluster. | sum by(cluster_id) (rate(crusoe_ib_port_throughput_tx{cluster_id="${clusterId}"}[5m]) / 1000) |
XID error logs with error details are also available in Crusoe Console > Cluster Metrics View.
Installing Crusoe Watch Agent for CMK Cluster
Step 1: Create a CMK Cluster with supported kubernetes versions and NVIDIA GPU operator
When creating a CMK Cluster using CLI, API, or UI, make sure that you install the NVIDIA GPU operator as an add-on for the CMK cluster. Additionally, if you wish to access full list of metrics, use one of the following Kubernetes versions:
1.30.8-cmk.36 stable
1.31.7-cmk.13 stable
1.32.7-cmk.16 stable
1.33.4-cmk.13 stable
Step 2: Install or update Crusoe CLI
Please follow the instructions to install and configure Crusoe CLI. If you already have the CLI installed and configured, please ensure you upgrade to the latest version.
Step 3: Install or verify Helm installation
Follow the instructions to install Helm.
Step 4: Deploy Crusoe Watch Agent
Make sure you have switched your Kubernetes context to the right Cluster:
crusoe kubernetes clusters get-credentials <cluster-name> --project-id <project-id>
kubectl config current-context
Then deploy the agent:
helm repo add crusoe-watch-agent https://crusoecloud.github.io/crusoe-watch-agent/k8s/helm-charts
helm repo update
helm install crusoe-watch-agent crusoe-watch-agent/crusoe-watch-agent --namespace crusoe-system
Querying metrics via API
Pre-requisite: Generate your monitoring token
To generate the monitoring token, run the following CLI command:
crusoe monitoring tokens create
This command generates an API-Key that you'll use for authentication when querying the metrics API. Please use a secret or key management tool to store the token content. You will not be able to retrieve it later.
API endpoint
You can directly query the metrics API endpoint to retrieve data for a single instant or a specific time range. The API endpoint is:
https://api.crusoecloud.com/v1alpha5/projects/<project-id>/metrics/timeseries
Here is an example curl command to retrieve the most recent data point for GPU utilization, in a project:
curl -G https://api.crusoecloud.com/v1alpha5/projects/<project-id>/metrics/timeseries\?query=\
DCGM_FI_DEV_GPU_UTIL \
-H 'Authorization: Bearer <API-Key>'
Importing data into Grafana
To import data into your own Grafana instance, add a Prometheus data source with the following options:
Prometheus Server URL: https://api.crusoecloud.com/v1alpha5/projects/<project-id>/metrics/timeseries
Authentication → HTTP Headers:
Header: Authorization
Value: Bearer <API-Key>
Use the API-Key generated in the 'Generate your monitoring token' section.
Known issues
Parsing errors caused by special characters
To prevent parsing errors caused by special characters like the dollar sign ($) in the CMK monitoring token during Helm chart deployment, reference the token from a Kubernetes Secret using the secretKeyRef mechanism. First, store the token securely in a Kubernetes Secret object. Then, configure your deployment's environment variables to reference the Secret; this ensures the raw, unescaped token is injected directly into your application container at runtime. Below is an example configuration snippet:
# In your application's Deployment or Pod manifest
env:
- name: CRUSOE_MONITORING_TOKEN
valueFrom:
secretKeyRef:
name: crusoe-monitoring-token # This must match the name of your Kubernetes Secret
key: CRUSOE_MONITORING_TOKEN # This must match the key used inside the Secret object