Metrics
Metrics enable you to monitor GPU, CPU, memory, disk, network, and interconnect performance across your CMK cluster and VMs. Metrics are collected automatically every 60 seconds and retained for 30 days. You can view them in the Console or query via Prometheus-compatible API.
You can also ingest custom application metrics for end-to-end visibility from hardware to application performance.
Infrastructure Metrics
Infrastructure metrics are collected automatically and are available for Crusoe Managed Kubernetes (CMK) and Crusoe Virtual Machines (VMs). The 30-day retention period applies to all infrastructure metric types, including:
- Cluster-level metrics — Aggregated views of cluster utilization and performance. See CMK Telemetry for the complete list.
- Node-level metrics — Granular GPU, CPU, memory, disk, and network metrics per node. See VM Telemetry for the complete list.
- InfiniBand metrics — Network throughput, latency, and error rates for IB-connected nodes. See InfiniBand Metrics.
- Storage metrics — IOPS, bandwidth, and capacity for shared disks. See Shared Disk Metrics.
- Load balancer metrics — Traffic throughput and connection metrics. See Load Balancer Metrics.
- Slurm metrics — Job queue, node state, and scheduler metrics for Managed Slurm clusters. See Slurm Metrics.
Custom Metrics
You can ingest custom application metrics alongside infrastructure metrics for end-to-end visibility from hardware to application performance. Custom metrics are available for Crusoe Managed Kubernetes (CMK) only.
What Are Custom Metrics
Custom metrics are application-defined metrics exposed from your workloads. Examples include training loss, learning rate, inference latency, throughput, batch processing times, and checkpoint frequency.
Exposing Custom Metrics
To expose custom metrics, format them in Prometheus format on an HTTP endpoint and annotate your pods to enable scraping:
apiVersion: v1
kind: Pod
metadata:
annotations:
crusoe.ai/scrape: "true" #enable custom metrics collection
crusoe.ai/port: "8080" #port on which metrics are exposed
crusoe.ai/path: "/my-app/metrics" #path on which metrics are exposed
spec:
containers:
- name: my-training-job
image: my-training-image:latest
ports:
- containerPort: 8080
Custom metrics are available through the same API endpoint as infrastructure metrics and can be queried using PromQL.
Prerequisites
To use Metrics, you need a CMK cluster or VM with Crusoe Watch Agent installed (see Installing the Crusoe Watch Agent). You also need the NVIDIA GPU Operator add-on for CMK clusters (if using GPU nodes).
Viewing Metrics
- Console
- API
- Grafana
Viewing Metrics in the Console
You can view a curated subset of the most critical infrastructure metrics in the Console.
Cluster-level metrics:
Navigate to Orchestration > select your cluster > Metrics sub-tab to view aggregated GPU utilization, CPU utilization, and memory usage across all nodes. Available time windows range from 1 hour to 30 days.
Node-level metrics:
- From the cluster Metrics view, click a node or navigate from Topology.
- View time-series graphs for GPU, CPU, memory, and network metrics.
Custom metrics are not available in the Console. Use the API or Grafana to query custom metrics.
Querying Metrics via API
You can query both infrastructure and custom metrics via the Prometheus-compatible API.
Generate Monitoring Token
crusoe monitoring tokens create
Store the monitoring-token securely. You cannot retrieve it later.
API Endpoint
Query the metrics API endpoint:
https://api.cloud.crusoe.ai/v1/projects/<project-id>/metrics/timeseries
Example — Retrieve the most recent GPU utilization:
curl -G https://api.cloud.crusoe.ai/v1/projects/<project-id>/metrics/timeseries\?query=\
DCGM_FI_DEV_GPU_UTIL \
-H 'Authorization: Bearer <monitoring-token>'
Query with PromQL
Use any valid PromQL expression to fetch time series data. Below is an example to retrieve the average GPU utilization per instance over 10 days.
curl -s -G \
"https://api.cloud.crusoe.ai/v1/projects/<project-id>/metrics/timeseries/api/v1/query_range" \
--data-urlencode 'query=avg(DCGM_FI_DEV_GPU_UTIL) by (instance)' \
--data-urlencode 'start=2026-02-01T00:00:00Z' \
--data-urlencode 'end=2026-02-11T00:00:00Z' \
--data-urlencode 'step=60s' \
-H "Authorization: Bearer <monitoring-token>"
Importing Data into Grafana
To import metrics into Grafana, add a Prometheus data source with the following configuration:
Prometheus Server URL:
https://api.cloud.crusoe.ai/v1/projects/<project-id>/metrics/timeseries
Authentication → HTTP Headers:
Header: Authorization
Value: Bearer <monitoring-token>
Use the monitoring-token from the token you generated earlier (see the API tab for instructions on generating a token).
Considerations
Parsing Errors Caused by Special Characters
Special characters like $ in monitoring tokens can cause parsing errors during Helm deployments. To avoid this, store the token in a Kubernetes Secret and reference it using secretKeyRef:
# In your application's Deployment or Pod manifest
env:
- name: CRUSOE_MONITORING_TOKEN
valueFrom:
secretKeyRef:
name: crusoe-monitoring-token # This must match the name of your Kubernetes Secret
key: CRUSOE_MONITORING_TOKEN # This must match the key used inside the Secret object
What's Next
- Topology — Visualize utilization across your cluster
- Logs — Correlate metrics with system and application logs
- Telemetry Conduit — Export metrics to external platforms