Skip to main content

Set Up Slurm on Kubernetes

This guide walks through setting up Managed Slurm as a pure Kubernetes add-on on an existing Crusoe Managed Kubernetes (CMK) cluster. Use this approach if you want full control over the underlying CMK cluster, want to manage Slurm components alongside other Kubernetes workloads, or already have GitOps/Helm-based infrastructure tooling that you want to extend to Slurm.

If you'd rather have Crusoe provision and manage everything end-to-end through a single command, see the Quickstart instead — it uses the crusoe slurm CLI to create the cluster, login nodes, and worker pools as a managed bundle. The two paths converge to the same end state; this guide just gives you direct kubectl/Helm access at every step.

note

In this mode you create the CMK cluster and your GPU compute node pools yourself, and all Slurm configuration lives in Kubernetes Custom Resources. You do not use the crusoe slurm clusters create or crusoe slurm nodesets create commands. CSO provisions the controller and login node pools automatically as part of SlurmCluster reconciliation.

How It Works

The Crusoe Slurm Operator (CSO) runs inside your CMK cluster and reconciles a SlurmCluster Custom Resource. When you apply the CR, CSO automatically provisions and configures:

  • The controller node pool (3 × c1a.4x for HA — managed by CSO, not user-configurable)
  • The login node pool (size and instance type configurable in the CR)
  • The Slinky operator (Slurm-on-Kubernetes from SchedMD)
  • The Slurm controller (slurmctld) and login pods
  • Topograph for topology-aware scheduling
  • The shared /home PersistentVolumeClaim
  • A LoadBalancer service for SSH access to login nodes

For compute workers, you create the GPU node pools yourself — CSO can't infer hardware type or scale automatically. The operator then watches for Kubernetes nodes labeled slurm.crusoe.ai/compute-node-type=true and automatically creates Slinky NodeSets for each underlying node pool.

Users and groups are managed via SlurmUser and SlurmUserGroup CRs, the same as in the CLI/API path. See User Management for the full reference.

Prerequisites

  • The Crusoe CLI (latest version) installed and authenticated, used here only for CMK cluster and nodepool creation
  • kubectl installed
  • helm v3.11+ installed
  • An SSH public key for accessing login nodes

Step 1: Create the CMK Cluster

Create a Managed Kubernetes cluster with the Slurm-supporting add-ons enabled. The required add-ons are:

  • crusoe_managed_slurm — installs the Crusoe Slurm Operator (CSO) on the cluster
  • nvidia_gpu_operator — exposes GPUs as schedulable resources
  • nvidia_network_operator — enables InfiniBand networking
  • crusoe_csi — provides the shared /home filesystem
  • autoclusters — automatic hardware remediation (recommended for production)
crusoe kubernetes clusters create \
--name my-slurm-cluster \
--location us-east1-a \
--add-ons crusoe_managed_slurm,nvidia_gpu_operator,nvidia_network_operator,crusoe_csi,autoclusters

Including crusoe_managed_slurm in --add-ons installs CSO automatically — you do not need to install it separately via Helm. CSO in turn installs cert-manager, the Slinky operator, Topograph, and the Crusoe Load Balancer Controller as it reconciles your SlurmCluster CR in Step 3.

Once the cluster is RUNNING, configure kubectl:

crusoe kubernetes clusters get-credentials my-slurm-cluster
kubectl cluster-info

Step 2: Create the slurm Namespace

All Slurm Custom Resources live in a single namespace. Only one SlurmCluster CR is supported per namespace.

kubectl create namespace slurm

Step 3: Apply the SlurmCluster Custom Resource

Applying a SlurmCluster CR triggers CSO to install the full Slurm stack — cert-manager, the Slinky operator, Topograph, the Crusoe Load Balancer Controller, the Slurm controller, login pods, and the shared /home volume. The controller node pool (3 c1a.4x nodes), login node pool (configurable in CRD below), and shared /home volume (configurable in CRD below) are provisioned automatically as part of this step.

Create a file named slurm-cluster.yaml:

apiVersion: slurm.crusoe.ai/v1alpha1
kind: SlurmCluster
metadata:
name: my-slurm-cluster
namespace: slurm
spec:
clusterVersion: "25.11.2-cmk.12"
loginSet:
replicas: 2
instanceType: c1a.8x
# Optional. CIDR ranges allowed to reach the login LoadBalancer.
# Defaults to 0.0.0.0/0 if omitted.
firewallRuleSourceRanges:
- 0.0.0.0/0
rootSSHPubKeys:
- "ssh-ed25519 AAAA... user@host"
# Shared /home volume. Default 10Ti. Can be increased later but never decreased.
homeVolumeSize: "10Ti"

Apply it:

kubectl apply -f slurm-cluster.yaml

Spec Reference

FieldRequiredDefaultNotes
clusterVersionYesSlurm version to install (e.g. 25.05)
loginSet.replicasYesNumber of login pods
loginSet.instanceTypeNoc1a.8xLogin node instance type. Immutable after creation.
loginSet.firewallRuleSourceRangesNo0.0.0.0/0CIDR ranges allowed to reach the login LoadBalancer service
rootSSHPubKeysYesSSH public keys authorized for the root user on login and worker nodes
homeVolumeSizeNo10TiShared /home volume size. Format <n>Ti (e.g. 10Ti, 50Ti). Can be increased after creation but never decreased.

Watching the Reconciliation

Cluster bring-up takes ~10 minutes once the node pools are ready. Watch the phases:

kubectl get slurmclusters -n slurm -w

The STATUS column moves through ProvisioningInstallingReady. For detail on which component is currently being configured:

kubectl describe slurmcluster my-slurm-cluster -n slurm

The conditions list (CertManagerReady, SlinkyReady, TopographReady, ControllerReady, LoginReady, etc.) indicates progress.

Once the cluster shows Ready, the LOGIN column on kubectl get slurmclusters displays 2/2 and ENDPOINT shows the external IP of the login LoadBalancer.

At this point you have a healthy Slurm control plane with login nodes, but no compute capacity yet. Add compute node pools next.

Step 4: Add Compute Node Pools

Compute pools provide the GPU capacity that runs your jobs. Unlike controller and login pools, compute pools are not auto-created by CSO — you create them yourself so you can choose the GPU type, count, and InfiniBand configuration. You can create multiple pools (for example, one per GPU type) and the operator will create a Slinky NodeSet for each one automatically. CPU node pools can also be created, but do not need the --ib-partition-id or --ephemeral-storage-for-containerd flags set.

Each compute pool needs the slurm.crusoe.ai/compute-node-type=true label and an InfiniBand partition ID:

crusoe kubernetes nodepools create \
--name slurm-h100-workers \
--cluster-name my-slurm-cluster \
--type h100-80gb-sxm-ib.8x \
--count 4 \
--ib-partition-id <your-ib-partition-id> \
--ephemeral-storage-for-containerd true \
--node-labels "slurm.crusoe.ai/compute-node-type=true"

CMK automatically applies crusoe.ai/nodepool.id and crusoe.ai/nodepool.name labels to every node based on the parent node pool. The Slurm operator uses the nodepool.id label to group compute nodes into NodeSets — one NodeSet per pool.

tip

You can add or remove compute pools at any time. The operator detects new pools via the node labels and creates additional NodeSets without any CR changes on your part.

Verify Compute Nodes Were Discovered

The operator watches for compute nodes and creates a Slinky NodeSet for each unique crusoe.ai/nodepool.id:

kubectl get nodesets -n slurm

You should see one entry per compute node pool, with READY showing <n>/<n> once Slurm has registered all the nodes.

You can also verify the labels on your compute nodes directly:

kubectl get nodes -L slurm.crusoe.ai/compute-node-type,crusoe.ai/nodepool.id

Each compute node should show true for compute-node-type and a non-empty nodepool.id.

Step 5: Add Users (Optional)

To allow non-root users to SSH in and submit jobs, apply SlurmUser and SlurmUserGroup CRs. The full reference, including UID/GID assignment and POSIX naming rules, is in User Management. Brief example:

apiVersion: slurm.crusoe.ai/v1alpha1
kind: SlurmUser
metadata:
name: alice
namespace: slurm
spec:
clusterReference: my-slurm-cluster
fullName: "Alice Johnson"
sshPublicKeys:
- "ssh-ed25519 AAAA... alice@laptop"
kubectl apply -f user-alice.yaml

The user can SSH in within ~1 minute:

ssh -i <path-to-private-key> alice@<login-node-endpoint>

You can find the login node endpoint with:

kubectl get slurmcluster my-slurm-cluster -n slurm -o jsonpath='{.status.loginEndpoint}'

Step 6: Run a Job

Once SSH'd into a login node, standard Slurm commands work as usual:

sinfo                              # Show available nodes
srun --gpus=8 nvidia-smi # Quick interactive GPU test
sbatch my-job.batch # Submit a batch job
squeue # Check the queue

For a multi-node NCCL test and other examples, see Advanced: Kubernetes Operations — Running NCCL Tests.

Adding More Compute Pools

To add capacity, simply create another node pool with the compute label:

crusoe kubernetes nodepools create \
--name slurm-a100-workers \
--cluster-name my-slurm-cluster \
--type a100-80gb-sxm-ib.8x \
--count 4 \
--ib-partition-id <your-ib-partition-id> \
--node-labels "slurm.crusoe.ai/compute-node-type=true"

Within a few seconds, a new Slinky NodeSet appears in the slurm namespace and the new nodes register with slurmctld. Run sinfo from a login node to confirm.

Updating the Cluster

Most spec fields on SlurmCluster are immutable after creation. The supported updates are:

  • homeVolumeSize — increase only (PVC limitation)
  • loginSet.replicas — scale up or down within the available login node count
  • rootSSHPubKeys — propagates to running login and worker pods within ~1 minute
  • loginSet.firewallRuleSourceRanges — updates the LoadBalancer firewall

Apply changes with kubectl apply -f slurm-cluster.yaml or kubectl edit slurmcluster my-slurm-cluster -n slurm.

Deleting the Cluster

Delete the SlurmCluster CR. The operator will tear down all components and clean up the LoadBalancer:

kubectl delete slurmcluster my-slurm-cluster -n slurm
danger

Deleting the SlurmCluster does not delete the shared /home volume. It must be manually deleted via the Crusoe Console or CLI.

When the SlurmCluster is deleted, CSO automatically removes the controller and login node pools it created. You then need to delete the compute node pools (or repurpose them for other workloads) and finally the CMK cluster itself:

crusoe kubernetes nodepools delete slurm-h100-workers --cluster-name my-slurm-cluster
crusoe kubernetes clusters delete my-slurm-cluster

Troubleshooting

Cluster stuck in Provisioning or Installing

Look at the conditions:

kubectl describe slurmcluster my-slurm-cluster -n slurm

Common causes:

  • LoginReady / ControllerReady stuck: CSO provisions these node pools automatically when you apply the SlurmCluster CR. If they're stuck, check that the underlying VM provisioning succeeded — crusoe kubernetes nodepools list --cluster-name my-slurm-cluster should show CSO-managed controller and login pools in RUNNING state. Quota or capacity issues at the project level are the most common reason.
  • SlinkyReady / TopographReady failing: the operator's Helm install is failing. Check the operator pod logs: kubectl logs -n slurm deploy/crusoe-slurm-operator.
  • CertManagerReady failing: cert-manager pods aren't healthy. Check kubectl get pods -n cert-manager.

Compute nodes don't appear in sinfo

Verify the labels on your compute nodes:

kubectl get nodes -l slurm.crusoe.ai/compute-node-type=true \
-L crusoe.ai/nodepool.id,crusoe.ai/nodepool.name

Each compute node must have:

  • slurm.crusoe.ai/compute-node-type=true (set by you on the node pool)
  • crusoe.ai/nodepool.id=<id> (auto-set by CMK)

If labels are missing, the operator's NodeReconciler will not create a NodeSet for them. Re-create the node pool with the correct --node-labels argument.

For a deeper diagnostic dump (recommended when escalating to support), run the slurm-debug tool — see the Advanced: Kubernetes Operations guide.

Next Steps