Metrics

Metrics enables you to monitor GPU, CPU, memory, disk, network, and interconnect performance across your CMK cluster and VMs. Metrics are collected automatically every 60 seconds. You can view them in the Console or query via Prometheus-compatible API.

You can also ingest custom application metrics for end-to-end visibility from hardware to application performance.

Metrics data is retained for 30 days and is available for Crusoe Managed Kubernetes (CMK) and Crusoe Virtual Machines (VMs).

Infrastructure Metrics

Infrastructure metrics are collected automatically per node, including GPU (DCGM), CPU, memory, disk, network, and interconnect metrics.

Cluster-level metrics — Aggregated views of cluster utilization and performance. See CMK Metrics for the complete list.
Node-level metrics — Granular views of node utilization and performance. See VM Metrics for the complete list.

Custom Metrics

You can ingest custom application metrics alongside infrastructure metrics for end-to-end visibility from hardware to application performance. Custom metrics are available for Crusoe Managed Kubernetes (CMK) only.

What Are Custom Metrics

Custom metrics are application-defined metrics exposed from your workloads. Examples include training loss, learning rate, inference latency, throughput, batch processing times, and checkpoint frequency.

Exposing Custom Metrics

To expose custom metrics, format them in Prometheus format on an HTTP endpoint and annotate your pods to enable scraping:

apiVersion: v1
kind: Pod
metadata:
  annotations:
    crusoe.ai/scrape: "true" #enable custom metrics collection
    crusoe.ai/port: "8080" #port on which metrics are exposed
    crusoe.ai/path: "/my-app/metrics" #path on which metrics are exposed
spec:
  containers:
    - name: my-training-job
      image: my-training-image:latest
      ports:
        - containerPort: 8080

Custom metrics are available through the same API endpoint as infrastructure metrics and can be queried using PromQL.

Prerequisites

To use Metrics, you need a CMK cluster or VM with Crusoe Watch Agent installed (see Installing the Crusoe Watch Agent). You also need the NVIDIA GPU Operator add-on for CMK clusters (if using GPU nodes).

Viewing Metrics

Console
API
Grafana

Viewing Metrics in the Console

You can view a curated subset of the most critical infrastructure metrics in the Console.

Cluster-level metrics:

Navigate to Orchestration > select your cluster > Metrics sub-tab to view aggregated GPU utilization, CPU utilization, and memory usage across all nodes. Available time windows range from 1 hour to 30 days.

Node-level metrics:

From the cluster Metrics view, click a node or navigate from Topology.
View time-series graphs for GPU, CPU, memory, and network metrics.

note

Custom metrics are not available in the Console. Use the API or Grafana to query custom metrics.

Querying Metrics via API

You can query both infrastructure and custom metrics via the Prometheus-compatible API.

Generate Monitoring Token

crusoe monitoring tokens create

Store the monitoring-token securely. You cannot retrieve it later.

API Endpoint

Query the metrics API endpoint:

https://api.crusoecloud.com/v1alpha5/projects/<project-id>/metrics/timeseries

Example — Retrieve the most recent GPU utilization:

curl -G https://api.crusoecloud.com/v1alpha5/projects/<project-id>/metrics/timeseries\?query=\
DCGM_FI_DEV_GPU_UTIL \
-H 'Authorization: Bearer <monitoring-token>'

Query with PromQL

Use any valid PromQL expression to fetch time series data. Below is an example to retrieve the average GPU utilization per instance over 10 days.

curl -s -G \
  "https://api.crusoecloud.com/v1alpha5/projects/<project-id>/metrics/timeseries/api/v1/query_range" \
  --data-urlencode 'query=avg(DCGM_FI_DEV_GPU_UTIL) by (instance)' \
  --data-urlencode 'start=2026-02-01T00:00:00Z' \
  --data-urlencode 'end=2026-02-11T00:00:00Z' \
  --data-urlencode 'step=60s' \
  -H "Authorization: Bearer <monitoring-token>"

Importing Data into Grafana

To import metrics into Grafana, add a Prometheus data source with the following configuration:

Prometheus Server URL:

https://api.crusoecloud.com/v1alpha5/projects/<project-id>/metrics/timeseries

Authentication → HTTP Headers:

Header: Authorization
Value: Bearer <monitoring-token>

Use the monitoring-token from the token you generated earlier (see the API tab for instructions on generating a token).

Troubleshooting

Parsing Errors Caused by Special Characters

Special characters like $ in monitoring tokens can cause parsing errors during Helm deployments. To avoid this, store the token in a Kubernetes Secret and reference it using secretKeyRef:

# In your application's Deployment or Pod manifest
env:
  - name: CRUSOE_MONITORING_TOKEN
    valueFrom:
      secretKeyRef:
        name: crusoe-monitoring-token # This must match the name of your Kubernetes Secret
        key: CRUSOE_MONITORING_TOKEN # This must match the key used inside the Secret object

What's Next

Topology — Visualize utilization across your cluster
Logs — Correlate metrics with system and application logs
Telemetry Relay — Export metrics to external platforms

Metrics

Infrastructure Metrics​

Custom Metrics​

What Are Custom Metrics​

Exposing Custom Metrics​

Prerequisites​

Viewing Metrics​

Viewing Metrics in the Console​

Querying Metrics via API​

Generate Monitoring Token​

API Endpoint​

Query with PromQL​

Importing Data into Grafana​

Troubleshooting​

Parsing Errors Caused by Special Characters​

What's Next​