Crusoe Managed Slurm
Crusoe Managed Slurm provides managed HPC cluster orchestration on Crusoe Cloud. With a single UI form or CLI command, you can provision a complete Slurm cluster backed by Crusoe's GPU-optimized infrastructure.
Managed Slurm combines Slurm's industry-standard job scheduling with Crusoe Managed Kubernetes (CMK), giving you a production-ready HPC environment with topology-aware scheduling, shared storage, and multi-user access.
How It Works
When you create a Managed Slurm cluster, Crusoe automatically provisions:
- A CMK cluster with all required add-ons and networking
- Slurm control plane — the Slurm controller and database, running as Kubernetes pods
- Login nodes — SSH-accessible entry points for submitting and managing jobs
- Shared storage — a ReadWriteMany persistent volume mounted at
/homeacross all nodes - Topology discovery — automatic network topology detection for optimal job placement
You then add node sets — groups of GPU worker nodes — to provide compute capacity. Slurm automatically discovers these nodes and makes them available for job scheduling.
The entire stack is managed by the Crusoe Slurm Operator (CSO), which runs inside your cluster's control plane and keeps all Slurm components healthy and in sync.
Key Concepts
| Concept | Description |
|---|---|
| Slurm Cluster | The top-level resource. Includes the Slurm controller, login nodes, and shared storage. Created via crusoe slurm clusters create. |
| Node Set | A group of GPU worker nodes attached to a Slurm cluster. Each node set maps to an underlying CMK node pool. Created via crusoe slurm nodesets create. |
| Login Node | An SSH-accessible pod where users connect to submit and manage jobs. Multiple replicas can be configured for availability. |
| Shared Storage | A persistent filesystem mounted at /home on all login and worker nodes. Backed by Crusoe CSI. |
| Users & Groups | Linux users provisioned across all Slurm components via Kubernetes Custom Resources. See User Management. |
Supported GPU Types
| GPU | Instance Type |
|---|---|
| 8x NVIDIA B200 180GB | b200-180gb-sxm-ib.8x |
| 8x NVIDIA H200 141GB | h200-141gb-sxm-ib.8x |
| 8x NVIDIA H100 80GB | h100-80gb-sxm-ib.8x |
| 8x NVIDIA A100 80GB | a100-80gb-sxm-ib.8x |
Support for additional GPU types is coming soon.
Prerequisites
Before you begin, ensure you have:
- Crusoe CLI installed and authenticated
- Project permissions to create Kubernetes clusters and Slurm resources
- An SSH key pair for accessing login nodes
What's Included
When you create a Managed Slurm cluster, the following components are automatically installed and managed. You do not need to install or configure these yourself:
| Component | Purpose |
|---|---|
| Crusoe Slurm Operator (CSO) | Manages Slurm lifecycle and configuration |
| Slinky | Runs Slurm daemons (slurmctld, slurmd) as Kubernetes pods |
| Topograph | Discovers network topology for topology-aware job scheduling |
| cert-manager | Certificate management for internal services |
| Crusoe CSI Driver | Provides shared filesystem storage |
| NVIDIA GPU Operator | GPU device plugin and drivers |
| NVIDIA Network Operator | InfiniBand and high-speed networking |
| Crusoe Load Balancer Controller | Exposes login nodes via external IP |
| Slurm Login Nodes | By default, two c1a.8x login nodes will be created. You can set a different login node type or a different number of login nodes. These are billable resources that are required for Slurm to run. |
| Slurm Controller Nodes | Three c1a.4x slurm controller nodes will be created to run the Slurm control plane. These are billable resources that are required for Slurm to run. |
Automatic Hardware Remediation
Managed Slurm clusters include AutoClusters, which automatically detects critical hardware failures such as GPUs or HCAs falling off the bus. When an issue is detected, the affected node is taken down and any running jobs are cancelled and requeued to healthy nodes. The bad node is then replaced automatically. For details on how to handle this in your jobs — including the SIGTERM grace period — see Automatic Hardware Remediation.
Next Steps
- Quickstart — Create your first Slurm cluster and run a GPU job
- User Management — Add users and groups to your cluster
- Managing Partitions — Create and manage partitions in your Slurm cluster
- Slurm Metrics — Monitor cluster health and job performance
- Advanced: Kubernetes Operations — Direct kubectl access and CRD-level configuration
- For Slurm command reference, see the official Slurm documentation