Skip to main content

CMK Telemetry

Command Center gives you visibility into the health, performance, and behaviors of your Crusoe Managed Kubernetes Clusters. Three categories of telemetry are available:

Metrics: Infrastructure metrics covering GPU (via DCGM), CPU, memory, disk, network, and InfiniBand performance, are collected at 60-second intervals and retained for 30 days. Custom application metrics are also supported: expose metrics in Prometheus format from your pods and annotate them for scraping alongside infrastructure metrics. A subset of metrics is viewable in the Console. The full dataset is available via Prometheus-compatible API, Grafana, and Telemetry Conduit.

From the Console, navigate to Orchestration, select your cluster, and then select the Metrics tab to see aggregated cluster-level views. Select a node to drill into node-level detail.

Logs: JournalD system logs, kubelet logs, container runtime logs, and container logs are collected from each node and are available to search, filter, and query. Logs are retained for 7 days. In the Console, navigate to Orchestration, select your cluster, then select the Logs tab. Logs are also accessible via LogsQL API.

NVIDIA Bug Reports: Generate an NVIDIA bug report for any node directly from the Console. Bug reports include nvidia-smi output and kernel XID logs. Helm chart v0.3.12 or higher is required. In the Console, navigate to Orchestration, select your cluster, select a node, then use the action menu to generate a report.

note

AMD bug report generation isn't currently supported for CMK clusters.

For installation, token generation, and access method details, see Get started.

Pre-built Grafana dashboard templates for CMK clusters are available in the Crusoe solutions library. Templates cover GPU utilization, InfiniBand fabric health, power draw, XID error tracking, storage, and network.

Available Metrics

The following metrics are available at the cluster level in the Crusoe Console and via API. Node-level metrics are available in the VM telemetry view. Visit VM Telemetry for the full list of node-level metrics and their query parameters.

MetricDefinitionSuggested Query
Cluster TFLOPS (FP16)Average 16-bit GPU throughput (TFLOPS) across a node pool, based on scaled Tensor Core utilization.avg by (nodepool) (DCGM_FI_PROF_PIPE_TENSOR_ACTIVE{cluster_id="${clusterId}"}[60s]) * theoretical max TFLOPS
Average GPU Utilization (%)The GPU utilization, averaged across all GPUs within a node pool.avg by (nodepool) (DCGM_FI_DEV_GPU_UTIL{cluster_id="${clusterId}"}[60s])
Cumulative Power Usage (kW)Cumulative power usage summed up across all nodes in a node pool.sum by (nodepool) (DCGM_FI_DEV_POWER_USAGE{cluster_id="${clusterId}"}[60s]) / 1000
Average GPU Memory Utilization (%)The GPU memory utilization, averaged across all GPUs within a node pool.avg by (nodepool) (avg_over_time(DCGM_FI_DEV_GPU_UTIL{cluster_id="${clusterId}"}[60s]))
Average CPU Utilization (%)The CPU utilization, averaged across all CPU cores within a node pool.sum by (nodepool) (rate(crusoe_vm_cpu_seconds_total{cluster_id="${clusterId}", mode!="idle"}[60s])))/(sum by (nodepool) (rate(crusoe_vm_cpu_seconds_total{cluster_id="${clusterId}"}[60s])) * 100
Aggregated VPC Network Bandwidth In (bytes per second)The rate of data received via the VPC network interface, aggregated across all nodes in a node pool.sum by (nodepool) (rate(crusoe_vm_network_receive_bytes_total{cluster_id="${clusterId}", device!~"lo"}[60s]))
Aggregated VPC Network Bandwidth Out (bytes per second)The rate of data transmitted via the VPC network interface, aggregated across all nodes in a node pool.sum by (nodepool) (rate(crusoe_vm_network_transmit_bytes_total{cluster_id="${clusterId}", device!~"lo"}[60s]))
Uncorrectable ECC Error RateThe rate of uncorrectable double-bit memory errors (DBE) accumulated across all GPUs in a node pool.sum by (nodepool) (rate(DCGM_FI_DEV_ECC_DBE_VOL_TOTAL{cluster_id="${clusterId}"}[RANGE]))
Correctable ECC Error RateThe rate of correctable single-bit memory errors (SBE) accumulated across all GPUs in a node pool.sum by (nodepool) (rate(DCGM_FI_DEV_ECC_SBE_VOL_TOTAL{cluster_id="${clusterId}"}[60s]))
InfiniBand Throughput Rx (bytes per second)The rate of data received via InfiniBand port, aggregated across all nodes in a cluster.sum by(cluster_id) (rate(crusoe_ib_port_throughput_rx{cluster_id="${clusterId}"}[5m]) / 1000)
InfiniBand Throughput Tx (bytes per second)The rate of data transmitted via InfiniBand port, aggregated across all nodes in a cluster.sum by(cluster_id) (rate(crusoe_ib_port_throughput_tx{cluster_id="${clusterId}"}[5m]) / 1000)

XID error logs with error details are also available in Orchestration > cluster > Metrics tab.

Custom Metrics

You can ingest custom application metrics alongside infrastructure metrics for end-to-end visibility from hardware to application performance. Custom metrics are available for CMK clusters only.

To expose custom metrics, format them in Prometheus format on an HTTP endpoint and annotate your pods to enable scraping:

apiVersion: v1
kind: Pod
metadata:
annotations:
crusoe.ai/scrape: "true"
crusoe.ai/port: "8080"
crusoe.ai/path: "/my-app/metrics"
spec:
containers:
- name: my-training-job
image: my-training-image:latest
ports:
- containerPort: 8080

Custom metrics are available through the same API endpoint as infrastructure metrics and can be queried using PromQL.

note

Custom metrics aren't available in the Console. Use the API or Grafana to query custom metrics.

For token generation and querying metrics via API or Grafana, see Get started and Metrics.

Considerations

Parsing errors caused by special characters

To prevent parsing errors caused by special characters like $ in the monitoring token during Helm chart deployment, reference the token from a Kubernetes Secret using the secretKeyRef mechanism:

# In your application's Deployment or Pod manifest
env:
- name: CRUSOE_MONITORING_TOKEN
valueFrom:
secretKeyRef:
name: crusoe-monitoring-token # must match the name of your Kubernetes Secret
key: CRUSOE_MONITORING_TOKEN # must match the key used inside the Secret object