Slurm Metrics
Slurm Metrics is currently in preview. Please reach out to Crusoe Cloud Support to learn more.
Crusoe Managed Slurm metrics provide you with comprehensive insights into the performance and utilization of your Slurm clusters. These metrics help you monitor cluster health, job performance, resource utilization, and identify bottlenecks in your HPC workloads.
Crusoe Cloud collects Slurm-specific metrics alongside infrastructure metrics (GPU, CPU, memory, disk, and network) for your Slurm clusters. Data collection requires installing the Crusoe Watch Agent on your CMK cluster running Managed Slurm. Metrics are collected and published in 60-second intervals, and are retained for 30 days.
The Crusoe Watch Agent automatically detects and scrapes Slurm metrics when a Slurm controller pod is present in the cluster. You can customize or disable this behavior through agent configuration.
Prerequisites
To use Slurm Metrics, you need:
- A Crusoe Managed Slurm cluster deployed on CMK (see Crusoe Managed Slurm Overview)
- Crusoe Watch Agent installed on the CMK cluster (see CMK Metrics - Installing Crusoe Watch Agent, agent version 0.3.11 or later is required)
- NVIDIA GPU Operator add-on enabled (if your Slurm workloads use GPUs)
Configuring Slurm Metrics Collection
The Crusoe Watch Agent automatically scrapes Slurm metrics when it detects a Slurm controller pod in your cluster. You can customize this behavior by configuring the agent.
Default Behavior
By default, the agent scrapes the following Slurm metrics endpoints:
/metrics/jobs— Job-level metrics/metrics/jobs-users-accts— User and account job metrics/metrics/nodes— Node state and allocation metrics/metrics/partitions— Partition metrics/metrics/scheduler— Scheduler performance metrics
Customizing Metrics Collection
To customize which metrics are collected, create a values.yaml file with your preferred Slurm metrics configuration:
slurmMetrics:
enabled: true
paths:
- /metrics/jobs
- /metrics/jobs-users-accts
- /metrics/nodes
- /metrics/partitions
- /metrics/scheduler
Disabling Slurm Metrics
To disable Slurm metrics collection entirely:
slurmMetrics:
enabled: false
Applying Configuration
Apply your custom configuration using Helm:
helm upgrade crusoe-watch-agent crusoe-watch-agent/crusoe-watch-agent --namespace crusoe-system -f values.yaml
Available Metrics
Slurm metrics are accessible through the Prometheus-compatible query API. The metrics include:
Slurm Job Metrics
- Job queue length — Number of jobs waiting in the queue
- Running jobs — Number of currently executing jobs
- Job wait time — Time jobs spend in queue before execution
- Job completion rate — Rate of job completions over time
Slurm Node Metrics
- Node state — Current state of Slurm nodes (idle, allocated, down, drain)
- Node allocation — Percentage of nodes allocated vs. idle
- Node availability — Number of available nodes for job scheduling
Resource Utilization Metrics
In addition to Slurm-specific metrics, you can access all standard infrastructure metrics for your Slurm cluster nodes:
- GPU utilization, memory, temperature, and power draw
- CPU utilization and system memory
- Network bandwidth (VPC and InfiniBand)
- Storage I/O metrics
For a complete list of infrastructure metrics, see CMK Metrics and VM Metrics.
Accessing Slurm Metrics
Via API
You can query Slurm metrics using the same Prometheus-compatible API endpoint used for CMK metrics:
https://api.crusoecloud.com/v1alpha5/projects/<project-id>/metrics/timeseries
Use the monitoring token generated for your project (see CMK Metrics - Generate Monitoring Token).
Example query for Slurm job queue length:
curl -G https://api.crusoecloud.com/v1alpha5/projects/<project-id>/metrics/timeseries\?query=\
slurm_queue_length{cluster_id="<cluster-id>"} \
-H 'Authorization: Bearer <API-Key>'
Monitoring Best Practices
Tracking Job Performance
Monitor job wait times and queue lengths to identify scheduling bottlenecks. High wait times may indicate:
- Insufficient compute resources
- Inefficient job packing
- Need for additional node pools
Resource Optimization
Use GPU and CPU utilization metrics alongside Slurm job metrics to:
- Identify underutilized nodes
- Optimize job resource requests
- Right-size node pools for your workload
Cluster Health
Monitor node state metrics to detect:
- Nodes in drain state requiring attention
- Hardware failures affecting job scheduling
- Capacity constraints
What's Next
- Crusoe Managed Slurm Overview — Learn how to deploy and manage Slurm clusters
- CMK Metrics — Explore infrastructure metrics for your cluster
- Command Center — Unified observability platform