Skip to main content

Metrics

Metrics enables you to monitor GPU, CPU, memory, disk, network, and interconnect performance across your CMK cluster and VMs. Metrics are collected automatically every 60 seconds. You can view them in the Console or query via Prometheus-compatible API.

You can also ingest custom application metrics for end-to-end visibility from hardware to application performance.

Metrics data is retained for 30 days and is available for Crusoe Managed Kubernetes (CMK) and Crusoe Virtual Machines (VMs).

Infrastructure Metrics

Infrastructure metrics are collected automatically per node, including GPU (DCGM), CPU, memory, disk, network, and interconnect metrics.

  • Cluster-level metrics — Aggregated views of cluster utilization and performance. See CMK Metrics for the complete list.
  • Node-level metrics — Granular views of node utilization and performance. See VM Metrics for the complete list.

Custom Metrics

You can ingest custom application metrics alongside infrastructure metrics for end-to-end visibility from hardware to application performance. Custom metrics are available for Crusoe Managed Kubernetes (CMK) only.

What Are Custom Metrics

Custom metrics are application-defined metrics exposed from your workloads. Examples include training loss, learning rate, inference latency, throughput, batch processing times, and checkpoint frequency.

Exposing Custom Metrics

To expose custom metrics, format them in Prometheus format on an HTTP endpoint and annotate your pods to enable scraping:

apiVersion: v1
kind: Pod
metadata:
annotations:
crusoe.ai/scrape: "true" #enable custom metrics collection
crusoe.ai/port: "8080" #port on which metrics are exposed
crusoe.ai/path: "/my-app/metrics" #path on which metrics are exposed
spec:
containers:
- name: my-training-job
image: my-training-image:latest
ports:
- containerPort: 8080

Custom metrics are available through the same API endpoint as infrastructure metrics and can be queried using PromQL.

Prerequisites

To use Metrics, you need a CMK cluster or VM with Crusoe Watch Agent installed (see Installing the Crusoe Watch Agent). You also need the NVIDIA GPU Operator add-on for CMK clusters (if using GPU nodes).

Viewing Metrics

Viewing Metrics in the Console

You can view a curated subset of the most critical infrastructure metrics in the Console.

Cluster-level metrics:

Navigate to Orchestration > select your cluster > Metrics sub-tab to view aggregated GPU utilization, CPU utilization, and memory usage across all nodes. Available time windows range from 1 hour to 30 days.

Node-level metrics:

  1. From the cluster Metrics view, click a node or navigate from Topology.
  2. View time-series graphs for GPU, CPU, memory, and network metrics.
note

Custom metrics are not available in the Console. Use the API or Grafana to query custom metrics.

Troubleshooting

Parsing Errors Caused by Special Characters

Special characters like $ in monitoring tokens can cause parsing errors during Helm deployments. To avoid this, store the token in a Kubernetes Secret and reference it using secretKeyRef:

# In your application's Deployment or Pod manifest
env:
- name: CRUSOE_MONITORING_TOKEN
valueFrom:
secretKeyRef:
name: crusoe-monitoring-token # This must match the name of your Kubernetes Secret
key: CRUSOE_MONITORING_TOKEN # This must match the key used inside the Secret object

What's Next

  • Topology — Visualize utilization across your cluster
  • Logs — Correlate metrics with system and application logs
  • Telemetry Relay — Export metrics to external platforms