Skip to main content

Crusoe Managed Slurm

Crusoe Managed Slurm provides managed HPC cluster orchestration on Crusoe Cloud. With a single UI form or CLI command, you can provision a complete Slurm cluster backed by Crusoe's GPU-optimized infrastructure.

Managed Slurm combines Slurm's industry-standard job scheduling with Crusoe Managed Kubernetes (CMK), giving you a production-ready HPC environment with topology-aware scheduling, shared storage, and multi-user access.

How It Works

When you create a Managed Slurm cluster, Crusoe automatically provisions:

  1. A CMK cluster with all required add-ons and networking
  2. Slurm control plane — the Slurm controller and database, running as Kubernetes pods
  3. Login nodes — SSH-accessible entry points for submitting and managing jobs
  4. Shared storage — a ReadWriteMany persistent volume mounted at /home across all nodes
  5. Topology discovery — automatic network topology detection for optimal job placement

You then add node sets — groups of GPU worker nodes — to provide compute capacity. Slurm automatically discovers these nodes and makes them available for job scheduling.

The entire stack is managed by the Crusoe Slurm Operator (CSO), which runs inside your cluster's control plane and keeps all Slurm components healthy and in sync.

Key Concepts

ConceptDescription
Slurm ClusterThe top-level resource. Includes the Slurm controller, login nodes, and shared storage. Created via crusoe slurm clusters create.
Node SetA group of GPU worker nodes attached to a Slurm cluster. Each node set maps to an underlying CMK node pool. Created via crusoe slurm nodesets create.
Login NodeAn SSH-accessible pod where users connect to submit and manage jobs. Multiple replicas can be configured for availability.
Shared StorageA persistent filesystem mounted at /home on all login and worker nodes. Backed by Crusoe CSI.
Users & GroupsLinux users provisioned across all Slurm components via Kubernetes Custom Resources. See User Management.

Supported GPU Types

GPUInstance Type
8x NVIDIA B200 180GBb200-180gb-sxm-ib.8x
8x NVIDIA H200 141GBh200-141gb-sxm-ib.8x
8x NVIDIA H100 80GBh100-80gb-sxm-ib.8x
8x NVIDIA A100 80GBa100-80gb-sxm-ib.8x

Support for additional GPU types is coming soon.

Prerequisites

Before you begin, ensure you have:

  • Crusoe CLI installed and authenticated
  • Project permissions to create Kubernetes clusters and Slurm resources
  • An SSH key pair for accessing login nodes

What's Included

When you create a Managed Slurm cluster, the following components are automatically installed and managed. You do not need to install or configure these yourself:

ComponentPurpose
Crusoe Slurm Operator (CSO)Manages Slurm lifecycle and configuration
SlinkyRuns Slurm daemons (slurmctld, slurmd) as Kubernetes pods
TopographDiscovers network topology for topology-aware job scheduling
cert-managerCertificate management for internal services
Crusoe CSI DriverProvides shared filesystem storage
NVIDIA GPU OperatorGPU device plugin and drivers
NVIDIA Network OperatorInfiniBand and high-speed networking
Crusoe Load Balancer ControllerExposes login nodes via external IP
Slurm Login NodesBy default, two c1a.8x login nodes will be created. You can set a different login node type or a different number of login nodes. These are billable resources that are required for Slurm to run.
Slurm Controller NodesThree c1a.4x slurm controller nodes will be created to run the Slurm control plane. These are billable resources that are required for Slurm to run.

Automatic Hardware Remediation

Managed Slurm clusters include AutoClusters, which automatically detects critical hardware failures such as GPUs or HCAs falling off the bus. When an issue is detected, the affected node is taken down and any running jobs are cancelled and requeued to healthy nodes. The bad node is then replaced automatically. For details on how to handle this in your jobs — including the SIGTERM grace period — see Automatic Hardware Remediation.

Next Steps