Skip to main content

Slurm Metrics

note

Slurm Metrics is currently in preview. Please reach out to Crusoe Cloud Support to learn more.

Crusoe Managed Slurm metrics provide you with comprehensive insights into the performance and utilization of your Slurm clusters. These metrics help you monitor cluster health, job performance, resource utilization, and identify bottlenecks in your HPC workloads.

Crusoe Cloud collects Slurm-specific metrics alongside infrastructure metrics (GPU, CPU, memory, disk, and network) for your Slurm clusters. Data collection requires installing the Crusoe Watch Agent on your CMK cluster running Managed Slurm. Metrics are collected and published in 60-second intervals, and are retained for 30 days.

The Crusoe Watch Agent automatically detects and scrapes Slurm metrics when a Slurm controller pod is present in the cluster. You can customize or disable this behavior through agent configuration.

Prerequisites

To use Slurm Metrics, you need:

Configuring Slurm Metrics Collection

The Crusoe Watch Agent automatically scrapes Slurm metrics when it detects a Slurm controller pod in your cluster. You can customize this behavior by configuring the agent.

Default Behavior

By default, the agent scrapes the following Slurm metrics endpoints:

  • /metrics/jobs — Job-level metrics
  • /metrics/jobs-users-accts — User and account job metrics
  • /metrics/nodes — Node state and allocation metrics
  • /metrics/partitions — Partition metrics
  • /metrics/scheduler — Scheduler performance metrics

Customizing Metrics Collection

To customize which metrics are collected, create a values.yaml file with your preferred Slurm metrics configuration:

slurmMetrics:
enabled: true
paths:
- /metrics/jobs
- /metrics/jobs-users-accts
- /metrics/nodes
- /metrics/partitions
- /metrics/scheduler

Disabling Slurm Metrics

To disable Slurm metrics collection entirely:

slurmMetrics:
enabled: false

Applying Configuration

Apply your custom configuration using Helm:

helm upgrade crusoe-watch-agent crusoe-watch-agent/crusoe-watch-agent --namespace crusoe-system -f values.yaml

Available Metrics

Slurm metrics are accessible through the Prometheus-compatible query API. The metrics include:

Slurm Job Metrics

  • Job queue length — Number of jobs waiting in the queue
  • Running jobs — Number of currently executing jobs
  • Job wait time — Time jobs spend in queue before execution
  • Job completion rate — Rate of job completions over time

Slurm Node Metrics

  • Node state — Current state of Slurm nodes (idle, allocated, down, drain)
  • Node allocation — Percentage of nodes allocated vs. idle
  • Node availability — Number of available nodes for job scheduling

Resource Utilization Metrics

In addition to Slurm-specific metrics, you can access all standard infrastructure metrics for your Slurm cluster nodes:

  • GPU utilization, memory, temperature, and power draw
  • CPU utilization and system memory
  • Network bandwidth (VPC and InfiniBand)
  • Storage I/O metrics

For a complete list of infrastructure metrics, see CMK Metrics and VM Metrics.

Accessing Slurm Metrics

Via API

You can query Slurm metrics using the same Prometheus-compatible API endpoint used for CMK metrics:

https://api.crusoecloud.com/v1alpha5/projects/<project-id>/metrics/timeseries

Use the monitoring token generated for your project (see CMK Metrics - Generate Monitoring Token).

Example query for Slurm job queue length:

curl -G https://api.crusoecloud.com/v1alpha5/projects/<project-id>/metrics/timeseries\?query=\
slurm_queue_length{cluster_id="<cluster-id>"} \
-H 'Authorization: Bearer <API-Key>'

Monitoring Best Practices

Tracking Job Performance

Monitor job wait times and queue lengths to identify scheduling bottlenecks. High wait times may indicate:

  • Insufficient compute resources
  • Inefficient job packing
  • Need for additional node pools

Resource Optimization

Use GPU and CPU utilization metrics alongside Slurm job metrics to:

  • Identify underutilized nodes
  • Optimize job resource requests
  • Right-size node pools for your workload

Cluster Health

Monitor node state metrics to detect:

  • Nodes in drain state requiring attention
  • Hardware failures affecting job scheduling
  • Capacity constraints

What's Next