Skip to main content

Slurm Metrics

Crusoe Managed Slurm metrics provide comprehensive insights into the performance and utilization of your Slurm clusters. These metrics help you monitor cluster health, job performance, resource utilization, and identify bottlenecks in your HPC workloads.

Crusoe Cloud collects Slurm-specific metrics alongside infrastructure metrics (GPU, CPU, memory, disk, and network) for your Slurm clusters. Metrics are collected in 60-second intervals and retained for 30 days.

The Crusoe Watch Agent automatically detects and scrapes Slurm metrics when a Slurm controller pod is present in the cluster. You can customize or disable this behavior through agent configuration.

Prerequisites

note

Clusters created through the Slurm API (the Slurm tab on Crusoe Cloud console or the crusoe slurm CLI) have all of the following pre-requisites installed out of the box. You can disregard the pre-requisites below.

To use Slurm Metrics, you need:

Configuring Slurm Metrics Collection

The Crusoe Watch Agent automatically scrapes Slurm metrics when it detects a Slurm controller pod in your cluster. You can customize this behavior by configuring the agent.

Default Behavior

By default, the agent scrapes the following Slurm metrics endpoints:

  • /metrics/jobs — Job-level metrics
  • /metrics/jobs-users-accts — User and account job metrics
  • /metrics/nodes — Node state and allocation metrics
  • /metrics/partitions — Partition metrics
  • /metrics/scheduler — Scheduler performance metrics

Customizing Metrics Collection

To customize which metrics are collected, create a values.yaml file with your preferred Slurm metrics configuration:

slurmMetrics:
enabled: true
paths:
- /metrics/jobs
- /metrics/jobs-users-accts
- /metrics/nodes
- /metrics/partitions
- /metrics/scheduler

Disabling Slurm Metrics

To disable Slurm metrics collection entirely:

slurmMetrics:
enabled: false

Applying Configuration

Apply your custom configuration using Helm:

helm upgrade crusoe-watch-agent crusoe-watch-agent/crusoe-watch-agent --namespace crusoe-system -f values.yaml

Available Metrics

Slurm metrics are accessible through the Prometheus-compatible query API.

Slurm Job Metrics

  • Job queue length — Number of jobs waiting in the queue
  • Running jobs — Number of currently executing jobs
  • Job wait time — Time jobs spend in queue before execution
  • Job completion rate — Rate of job completions over time

Slurm Node Metrics

  • Node state — Current state of Slurm nodes (idle, allocated, down, drain)
  • Node allocation — Percentage of nodes allocated vs. idle
  • Node availability — Number of available nodes for job scheduling

Resource Utilization Metrics

In addition to Slurm-specific metrics, you can access all standard infrastructure metrics for your Slurm cluster nodes:

  • GPU utilization, memory, temperature, and power draw
  • CPU utilization and system memory
  • Network bandwidth (VPC and InfiniBand)
  • Storage I/O metrics

For a complete list of infrastructure metrics, see CMK Metrics and VM Metrics.

Accessing Slurm Metrics

Via API

Query Slurm metrics using the Prometheus-compatible API endpoint:

https://api.crusoecloud.com/v1alpha5/projects/<project-id>/metrics/timeseries

Use the monitoring token generated for your project (see CMK Metrics - Generate Monitoring Token).

Example query for Slurm job queue length:

curl -G https://api.crusoecloud.com/v1alpha5/projects/<project-id>/metrics/timeseries\?query=\
slurm_queue_length{cluster_id="<cluster-id>"} \
-H 'Authorization: Bearer <API-Key>'

Monitoring Best Practices

Tracking Job Performance

Monitor job wait times and queue lengths to identify scheduling bottlenecks. High wait times may indicate:

  • Insufficient compute resources — consider adding more node sets
  • Inefficient job packing
  • Need for additional node sets with different GPU types

Resource Optimization

Use GPU and CPU utilization metrics alongside Slurm job metrics to:

  • Identify underutilized nodes
  • Optimize job resource requests
  • Right-size node sets for your workload

Cluster Health

Monitor node state metrics to detect:

  • Nodes in drain state requiring attention
  • Hardware failures affecting job scheduling
  • Capacity constraints

Next Steps