Slurm Metrics

Crusoe Managed Slurm metrics provide comprehensive insights into the performance and utilization of your Slurm clusters. These metrics help you monitor cluster health, job performance, resource utilization, and identify bottlenecks in your HPC workloads.

Crusoe Cloud collects Slurm-specific metrics alongside infrastructure metrics (GPU, CPU, memory, disk, and network) for your Slurm clusters. Metrics are collected in 60-second intervals and retained for 30 days.

The Crusoe Watch Agent automatically detects and scrapes Slurm metrics when a Slurm controller pod is present in the cluster. You can customize or disable this behavior through agent configuration.

Prerequisites

note

Clusters created through the Slurm API (the Slurm tab on Crusoe Cloud console or the crusoe slurm CLI) have all of the following pre-requisites installed out of the box. You can disregard the pre-requisites below.

To use Slurm Metrics, you need:

A running Managed Slurm cluster (see Quickstart)
Crusoe Watch Agent installed on the CMK cluster (see CMK Metrics - Installing Crusoe Watch Agent, agent version 0.3.11 or later)
NVIDIA GPU Operator add-on enabled (included by default with Managed Slurm)

Configuring Slurm Metrics Collection

The Crusoe Watch Agent automatically scrapes Slurm metrics when it detects a Slurm controller pod in your cluster. You can customize this behavior by configuring the agent.

Default Behavior

By default, the agent scrapes the following Slurm metrics endpoints:

/metrics/jobs — Job-level metrics
/metrics/jobs-users-accts — User and account job metrics
/metrics/nodes — Node state and allocation metrics
/metrics/partitions — Partition metrics
/metrics/scheduler — Scheduler performance metrics

Customizing Metrics Collection

To customize which metrics are collected, create a values.yaml file with your preferred Slurm metrics configuration:

slurmMetrics:
  enabled: true
  paths:
    - /metrics/jobs
    - /metrics/jobs-users-accts
    - /metrics/nodes
    - /metrics/partitions
    - /metrics/scheduler

Disabling Slurm Metrics

To disable Slurm metrics collection entirely:

slurmMetrics:
  enabled: false

Applying Configuration

Apply your custom configuration using Helm:

helm upgrade crusoe-watch-agent crusoe-watch-agent/crusoe-watch-agent --namespace crusoe-system -f values.yaml

Available Metrics

Slurm metrics are accessible through the Prometheus-compatible query API.

Slurm Job Metrics

Job queue length — Number of jobs waiting in the queue
Running jobs — Number of currently executing jobs
Job wait time — Time jobs spend in queue before execution
Job completion rate — Rate of job completions over time

Slurm Node Metrics

Node state — Current state of Slurm nodes (idle, allocated, down, drain)
Node allocation — Percentage of nodes allocated vs. idle
Node availability — Number of available nodes for job scheduling

Resource Utilization Metrics

In addition to Slurm-specific metrics, you can access all standard infrastructure metrics for your Slurm cluster nodes:

GPU utilization, memory, temperature, and power draw
CPU utilization and system memory
Network bandwidth (VPC and InfiniBand)
Storage I/O metrics

For a complete list of infrastructure metrics, see CMK Metrics and VM Metrics.

Accessing Slurm Metrics

Via API

Query Slurm metrics using the Prometheus-compatible API endpoint:

https://api.crusoecloud.com/v1alpha5/projects/<project-id>/metrics/timeseries

Use the monitoring token generated for your project (see CMK Metrics - Generate Monitoring Token).

Example query for Slurm job queue length:

curl -G https://api.crusoecloud.com/v1alpha5/projects/<project-id>/metrics/timeseries\?query=\
slurm_queue_length{cluster_id="<cluster-id>"} \
-H 'Authorization: Bearer <API-Key>'

Monitoring Best Practices

Tracking Job Performance

Monitor job wait times and queue lengths to identify scheduling bottlenecks. High wait times may indicate:

Insufficient compute resources — consider adding more node sets
Inefficient job packing
Need for additional node sets with different GPU types

Resource Optimization

Use GPU and CPU utilization metrics alongside Slurm job metrics to:

Identify underutilized nodes
Optimize job resource requests
Right-size node sets for your workload

Cluster Health

Monitor node state metrics to detect:

Nodes in drain state requiring attention
Hardware failures affecting job scheduling
Capacity constraints

Next Steps

Quickstart — Set up your Slurm cluster
User Management — Add users and groups to your cluster
Managing Partitions — Create and manage partitions in your Slurm cluster
Advanced: Kubernetes Operations — Direct kubectl access and CRD-level configuration
For Slurm command reference, see the official Slurm documentation

Prerequisites​

Configuring Slurm Metrics Collection​

Default Behavior​

Customizing Metrics Collection​

Disabling Slurm Metrics​

Applying Configuration​

Available Metrics​

Slurm Job Metrics​

Slurm Node Metrics​

Resource Utilization Metrics​

Accessing Slurm Metrics​

Via API​

Monitoring Best Practices​

Tracking Job Performance​

Resource Optimization​

Cluster Health​

Next Steps​