Slurm Metrics
Crusoe Managed Slurm metrics provide comprehensive insights into the performance and utilization of your Slurm clusters. These metrics help you monitor cluster health, job performance, resource utilization, and identify bottlenecks in your HPC workloads.
Crusoe Cloud collects Slurm-specific metrics alongside infrastructure metrics (GPU, CPU, memory, disk, and network) for your Slurm clusters. Metrics are collected in 60-second intervals and retained for 30 days.
The Crusoe Watch Agent automatically detects and scrapes Slurm metrics when a Slurm controller pod is present in the cluster. You can customize or disable this behavior through agent configuration.
Prerequisites
Clusters created through the Slurm API (the Slurm tab on Crusoe Cloud console or the crusoe slurm CLI) have all of the following pre-requisites installed out of the box. You can disregard the pre-requisites below.
To use Slurm Metrics, you need:
- A running Managed Slurm cluster (see Quickstart)
- Crusoe Watch Agent installed on the CMK cluster (see CMK Metrics - Installing Crusoe Watch Agent, agent version 0.3.11 or later)
- NVIDIA GPU Operator add-on enabled (included by default with Managed Slurm)
Configuring Slurm Metrics Collection
The Crusoe Watch Agent automatically scrapes Slurm metrics when it detects a Slurm controller pod in your cluster. You can customize this behavior by configuring the agent.
Default Behavior
By default, the agent scrapes the following Slurm metrics endpoints:
/metrics/jobs— Job-level metrics/metrics/jobs-users-accts— User and account job metrics/metrics/nodes— Node state and allocation metrics/metrics/partitions— Partition metrics/metrics/scheduler— Scheduler performance metrics
Customizing Metrics Collection
To customize which metrics are collected, create a values.yaml file with your preferred Slurm metrics configuration:
slurmMetrics:
enabled: true
paths:
- /metrics/jobs
- /metrics/jobs-users-accts
- /metrics/nodes
- /metrics/partitions
- /metrics/scheduler
Disabling Slurm Metrics
To disable Slurm metrics collection entirely:
slurmMetrics:
enabled: false
Applying Configuration
Apply your custom configuration using Helm:
helm upgrade crusoe-watch-agent crusoe-watch-agent/crusoe-watch-agent --namespace crusoe-system -f values.yaml
Available Metrics
Slurm metrics are accessible through the Prometheus-compatible query API.
Slurm Job Metrics
- Job queue length — Number of jobs waiting in the queue
- Running jobs — Number of currently executing jobs
- Job wait time — Time jobs spend in queue before execution
- Job completion rate — Rate of job completions over time
Slurm Node Metrics
- Node state — Current state of Slurm nodes (idle, allocated, down, drain)
- Node allocation — Percentage of nodes allocated vs. idle
- Node availability — Number of available nodes for job scheduling
Resource Utilization Metrics
In addition to Slurm-specific metrics, you can access all standard infrastructure metrics for your Slurm cluster nodes:
- GPU utilization, memory, temperature, and power draw
- CPU utilization and system memory
- Network bandwidth (VPC and InfiniBand)
- Storage I/O metrics
For a complete list of infrastructure metrics, see CMK Metrics and VM Metrics.
Accessing Slurm Metrics
Via API
Query Slurm metrics using the Prometheus-compatible API endpoint:
https://api.crusoecloud.com/v1alpha5/projects/<project-id>/metrics/timeseries
Use the monitoring token generated for your project (see CMK Metrics - Generate Monitoring Token).
Example query for Slurm job queue length:
curl -G https://api.crusoecloud.com/v1alpha5/projects/<project-id>/metrics/timeseries\?query=\
slurm_queue_length{cluster_id="<cluster-id>"} \
-H 'Authorization: Bearer <API-Key>'
Monitoring Best Practices
Tracking Job Performance
Monitor job wait times and queue lengths to identify scheduling bottlenecks. High wait times may indicate:
- Insufficient compute resources — consider adding more node sets
- Inefficient job packing
- Need for additional node sets with different GPU types
Resource Optimization
Use GPU and CPU utilization metrics alongside Slurm job metrics to:
- Identify underutilized nodes
- Optimize job resource requests
- Right-size node sets for your workload
Cluster Health
Monitor node state metrics to detect:
- Nodes in drain state requiring attention
- Hardware failures affecting job scheduling
- Capacity constraints
Next Steps
- Quickstart — Set up your Slurm cluster
- User Management — Add users and groups to your cluster
- Managing Partitions — Create and manage partitions in your Slurm cluster
- Advanced: Kubernetes Operations — Direct kubectl access and CRD-level configuration
- For Slurm command reference, see the official Slurm documentation