Crusoe Managed Slurm on CMK

Crusoe Managed Slurm enables high-performance computing workload orchestration on Crusoe Cloud infrastructure. By deploying Slurm on top of Crusoe Managed Kubernetes (CMK), you can leverage Slurm's powerful job scheduling capabilities while benefiting from Kubernetes' container orchestration and Crusoe's GPU-optimized infrastructure.

This guide walks you through the process of setting up a Crusoe Managed Slurm cluster.

note

Crusoe Managed Slurm is currently available by request. Please reach out to Crusoe Cloud Support to learn more.

Prerequisites

Before you begin, ensure you have:

Access to the Crusoe CLI
Appropriate permissions to create CMK clusters

Supported GPU Types

8x NVIDIA B200 180GB (b200-180gb-sxm-ib.8x)
8x NVIDIA H200 141GB (h200-141gb-sxm-ib.8x)
8x NVIDIA H100 80GB (h100-80gb-sxm-ib.8x)
8x NVIDIA A100 80GB (a100-80gb-sxm-ib.8x)

Support for additional GPU types is coming soon.

Creating a Slurm-Enabled Cluster

Step 1: Create the CMK Cluster with Required Add-ons

Create a new CMK cluster with the Crusoe Slurm Operator and its dependencies using the following command:

note

If you are creating a new cluster, we recommend using CMK version 1.33.4-cmk.43 or above, which includes support for user management. The crusoe_managed_slurm add-on is only available on CMK versions above 1.33.4-cmk.36.

crusoe kubernetes clusters create \
  --name <name> \
  --cluster-version <cluster-version> \
  --location <location> \
  --subnet-id <subnet-id> \
  --add-ons crusoe_csi,nvidia_gpu_operator,nvidia_network_operator,crusoe_managed_slurm

Required Add-ons

The following add-ons must be included for Managed Slurm to function properly:

Add-on	Description
`crusoe_managed_slurm`	The Slurm operator for Kubernetes
`crusoe_csi`	Crusoe Container Storage Interface
`nvidia_gpu_operator`	NVIDIA GPU support
`nvidia_network_operator`	NVIDIA networking capabilities

Step 2: Create Node Pools

Once your CMK cluster is running, add the required node pools for your Slurm deployment.

Create a Control Plane Node Pool

Create a node pool for the Slurm control plane with the appropriate node labels. We recommend using a slice type of c1a.4x or larger:

crusoe kubernetes nodepools create \
  --name slurm-control \
  --count 1 \
  --cluster-name <cluster-name> \
  --type c1a.4x \
  --node-labels 'slurm.crusoe.ai/controller-node-type=true,slurm.crusoe.ai/login-node-type=true'

Create Worker Node Pools

Create node pools for Slurm workers with your desired instance type and count:

crusoe kubernetes nodepools create \
  --name slurm-workers \
  --count 2 \
  --cluster-name <cluster-name> \
  --type <desired-instance-type> \
  --ephemeral-storage-for-containerd true \
  --ib-partition-id <ib-partition> \
  --node-labels 'slurm.crusoe.ai/compute-node-type=true'

tip

Note the node pool ID from the command output, as you'll need it in Step 4.

Step 3: Configure Cluster Access

Once your CMK cluster is provisioned, configure your local kubectl to interact with the cluster:

crusoe kubernetes clusters get-credentials <cluster-name>

This command retrieves your cluster's kubeconfig and configures your local kubectl context. Verify the connection:

kubectl cluster-info

Step 4: Deploy the Slurm Cluster

Deploy your Slurm cluster by applying two Kubernetes custom resources: the SlurmCluster configuration and the SlurmNodeSet configuration.

Create the Slurm Namespace

Create a file named slurm-namespace.yaml with the following content:

apiVersion: v1
kind: Namespace
metadata:
  name: slurm

Apply the configuration to create the slurm namespace:

kubectl apply -f slurm-namespace.yaml

Configure the Slurm Cluster

Create a file named slurm-cluster.yaml with the following content. Replace the placeholder SSH key in spec.loginSet.rootSSHPublicKey with your public SSH key to enable access to the login nodes:

apiVersion: slurm.crusoe.ai/v1alpha1
kind: SlurmCluster
metadata:
  generateName: slurm-cluster-
  namespace: slurm
spec:
  containerRegistry: "ghcr.io/crusoecloud/cmk/slurm-containers"
  clusterVersion: "25.11.2-cmk.4"

  # Controller configuration
  controller:
    nodeSelector:
      slurm.crusoe.ai/controller-node-type: "true"

  # Login node configuration
  loginSet:
    replicas: 2 # Recommended to have at least 2 login replicas
    nodeSelector:
      slurm.crusoe.ai/login-node-type: "true"
    rootSSHPublicKey: |
      ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC... # Replace with your public SSH key

  # Shared storage for user home directories
  userHomeVolumeClaimTemplate:
    accessModes:
      - ReadWriteMany
    resources:
      requests:
        storage: 1Ti

Create the slurm cluster:

kubectl create -f slurm-cluster.yaml --save-config=true

This deploys the Slurm controller and login pods. The Crusoe Managed Slurm operator automatically installs and manages the following dependencies:

cert-manager - Certificate management for Kubernetes
Crusoe Load Balancer Controller - Load balancing for cluster services
Slinky - Slurm integration components
Topograph - Topology management

Configure the Compute Nodes

Create a file named slurm-node-set.yaml with the following content. Replace <slurm-cluster-generated-name> with the name generated when you created the SlurmCluster, and <worker-nodepool-id> with the node pool ID from Step 2:

apiVersion: slurm.crusoe.ai/v1alpha1
kind: SlurmNodeSet
metadata:
  name: slurm-worker-node-set
  namespace: slurm
spec:
  clusterReference: <slurm-cluster-generated-name>
  count: 2
  nodePoolID: "<worker-nodepool-id>"
  nodeSelector:
    slurm.crusoe.ai/compute-node-type: "true"

Apply the configuration:

kubectl apply -f slurm-node-set.yaml

This registers your worker nodes with the Slurm cluster and configures them. The nodes will be ready for job scheduling once the status.readyReplicas field of your SlurmNodeSet equals spec.count.

Step 5: Access the Cluster

Retrieve the SSH command to access your login node:

kubectl get services -n slurm -o json | jq -r '.items[] | select(.metadata.name | contains("login")) | "ssh root@\(.status.loadBalancer.ingress[0].ip)"'

This command outputs an SSH connection string. Use it to access your login pods:

ssh root@<external-ip>

tip

If you are prompted for a password, you may need to add -i path/to/private-key to specify your SSH key location.

You should now have access to your Slurm cluster and can begin submitting jobs.

Storage Configuration

Slurm requires a shared filesystem to share data across login and compute nodes. The storage class is automatically configured when you create the cluster with the crusoe_csi add-on enabled. This is what backs the /home directory on your login and worker pods.

The managed service creates a crusoe-csi-driver-fs-sc StorageClass that supports:

Volume binding mode: WaitForFirstConsumer
Volume expansion: Enabled
Provisioner: fs.csi.crusoe.ai

You can adjust storage capacity by modifying the userHomeVolumeClaimTemplate.resources.requests.storage value in the SlurmCluster custom resource and reapplying the configuration. If you are editing the slurm-cluster.yaml file directly, you may need to add name: <generatedName> to the metadata. To see these changes, run kubectl get pvc -A.

Using Your Slurm Cluster

Once connected to the login node, you can submit and manage jobs using standard Slurm commands:

Command	Description
`sinfo`	View cluster status and node information
`squeue`	View the job queue
`sbatch`	Submit a batch job
`srun`	Run a job interactively
`scancel`	Cancel a job

For GPU jobs, specify GPU requirements using the --gpus flag:

srun --gpus=1 nvidia-smi

Monitoring and Troubleshooting

Checking Cluster Status

Use the following commands to verify your cluster is healthy:

sinfo                   # Check node states
scontrol show node      # View detailed node information
scontrol show config    # View Slurm configuration

Common Issues

Issue	Resolution
Nodes in drain state	Check node reasons with `sinfo -R` to identify configuration issues
GPU not detected	Verify the GPU operator is running and nodes have the correct labels
Job allocation failures	Check available resources with `sinfo` and verify job requirements are within cluster capacity

Running NCCL Tests

To run NCCL tests, SSH into the login pod and switch to the /home directory. Create the following script named nccl_test.batch:

#!/bin/bash

#SBATCH --job-name=nccl_tests
#SBATCH --nodes=<number of nodes>
#SBATCH --ntasks-per-node=8
#SBATCH --gpus-per-node=8
#SBATCH --time=20:00
#SBATCH --output="%x_%j.out"
#SBATCH --exclusive

export NCCL_TOPO_FILE=/etc/crusoe/nccl_topo/h200-141gb-sxm-ib-cloud-hypervisor.xml
export NCCL_SOCKET_IFNAME=eth0
export NCCL_IB_HCA="mlx5_1:1,mlx5_2:1,mlx5_3:1,mlx5_4:1,mlx5_5:1,mlx5_6:1,mlx5_7:1,mlx5_8:1"
export NCCL_IB_MERGE_VFS=0
export NCCL_DEBUG=WARN

export OMPI_MCA_coll_hcoll_enable=0
export PMIX_MCA_gds='^ds12'

export UCX_NET_DEVICES="mlx5_1:1,mlx5_2:1,mlx5_3:1,mlx5_4:1,mlx5_5:1,mlx5_6:1,mlx5_7:1,mlx5_8:1"

srun --mpi=pmix /opt/nccl-tests/build/all_reduce_perf -b 2G -e 32G -f 2

Submit the test:

sbatch nccl_test.batch

From the login pod, use squeue to see the tasks being run. Once there are no more tasks in the queue, check the output file nccl_tests_<job_id>.out in the /home directory to view the results.

To interact directly with a worker pod:

srun --pty bash                              # Connect to any worker pod
srun --nodelist=<worker-pod-name> --pty bash # Connect to a specific worker pod

Deleting a Slurm Cluster

To delete the Slurm cluster, run the following commands:

kubectl delete namespace slurm
kubectl delete namespace slinky

This deletes the slurm and slinky namespaces and all the resources associated with them.

Next Steps

For detailed information on Slurm usage and advanced configuration options, refer to the official Slurm documentation.

Support

If you encounter issues during setup or need assistance with your Managed Slurm deployment, please contact Crusoe Cloud Support.

Crusoe Managed Slurm on CMK

Prerequisites​

Supported GPU Types​

Creating a Slurm-Enabled Cluster​

Step 1: Create the CMK Cluster with Required Add-ons​

Step 2: Create Node Pools​

Create a Control Plane Node Pool​

Create Worker Node Pools​

Step 3: Configure Cluster Access​

Step 4: Deploy the Slurm Cluster​

Create the Slurm Namespace​

Configure the Slurm Cluster​

Configure the Compute Nodes​

Step 5: Access the Cluster​

Storage Configuration​

Using Your Slurm Cluster​

Monitoring and Troubleshooting​

Checking Cluster Status​

Common Issues​

Running NCCL Tests​

Deleting a Slurm Cluster​

Next Steps​

Support​