Skip to main content

Quickstart

This guide walks you through creating a Managed Slurm cluster, adding GPU worker nodes, and running your first job — all from the Crusoe CLI. You can also create Slurm clusters through the Crusoe Cloud Console by navigating to Orchestration > Slurm in the left-hand navigation pane.

Step 1: Create a Slurm Cluster

Create a new Managed Slurm cluster with a single command:

crusoe slurm clusters create \
--name my-slurm-cluster \
--location us-east1-a \
--keyfile ~/.ssh/id_ed25519.pub \

This command provisions a complete Slurm environment including the underlying Kubernetes cluster, Slurm controller, login nodes, and shared storage. The required add-ons are automatically included.

Required flags:

FlagDescription
--nameName for your Slurm cluster
--locationCrusoe Cloud location (e.g., us-east1-a)
--keyfilePath to your SSH public key file for root access to login nodes

Optional flags:

FlagDefaultDescription
--login-node-typec1a.8xInstance type for login nodes. Only CPU types are supported.
--login-replicas2Number of login node replicas. Minimum 1, maximum 10.
--home-volume-size1TiShared /home volume size. Format: <n>Ti where n >= 1 and n <= 1000 (e.g. 1Ti, 10Ti, 50Ti)
--subnet-idSubnet ID for the cluster
note

Cluster creation typically takes around 20 minutes. The command will wait for the operation to complete and display the result.

Step 2: Check Cluster Status

Verify your cluster is running:

crusoe slurm clusters get my-slurm-cluster

Example output:

name: my-slurm-cluster
id: a1b2c3d4-e5f6-7890-abcd-ef1234567890
state: RUNNING
location: us-east1-a
login node endpoint: 160.211.64.102
login node type: c1a.8x
login replicas: 2
nodesets: []
home volume size: 1Ti
subnet id: 5efd0079-bf7e-4e0a-b879-b9af83ac3cac
root ssh pub keys: [ssh-ed25519 AAAA...]

Wait until the state field shows RUNNING before proceeding.

To list all your Slurm clusters:

crusoe slurm clusters list

Step 3: Add GPU Worker Nodes

Add a node set to provide GPU compute capacity:

crusoe slurm nodesets create \
--name gpu-workers \
--cluster-name my-slurm-cluster \
--type h100-80gb-sxm-ib.8x \
--count 2 \
--ib-partition-id <ib-partition-id>

Required flags:

FlagDescription
--nameName for the node set
--cluster-name or --cluster-idThe Slurm cluster to attach to
--typeGPU instance type (see Supported GPU Types)
--countNumber of worker nodes

Optional flags:

FlagDescription
--ib-partition-idInfiniBand partition ID for high-speed interconnect
--keyfilePath to SSH public key file for worker node access
--subnet-idSubnet for the node pool

Step 4: Verify Node Set Status

Check that your worker nodes are ready:

crusoe slurm nodesets list --cluster-name my-slurm-cluster

Example output:

name          id                                     type                   count  state
gpu-workers b2c3d4e5-f6a7-8901-bcde-f12345678901 h100-80gb-sxm-ib.8x 2 RUNNING

Wait until the state shows RUNNING. You can also check a specific node set:

crusoe slurm nodesets get gpu-workers --cluster-name my-slurm-cluster

Step 5: Connect to Your Cluster

Use the login node endpoint from Step 2 to SSH into your cluster:

ssh root@<login-node-endpoint>
tip

If you are prompted for a password, specify your private key explicitly: ssh -i ~/.ssh/id_ed25519 root@<login-node-endpoint>

Once connected, verify your Slurm cluster is healthy:

sinfo

You should see your worker nodes in an idle state, ready to accept jobs.

Step 6: Run Your First Job

Interactive GPU Test

Run a quick interactive test to verify GPU access:

srun --gpus=8 nvidia-smi

This allocates a worker node with 8 GPUs and runs nvidia-smi, displaying GPU information.

Batch Job

Create a batch job script named hello-gpu.batch:

#!/bin/bash

#SBATCH --job-name=hello-gpu
#SBATCH --nodes=1
#SBATCH --gpus-per-node=8
#SBATCH --time=5:00
#SBATCH --output=hello-gpu_%j.out

srun nvidia-smi

Submit the job:

sbatch hello-gpu.batch

Monitor the job:

squeue

Once the job completes, check the output:

cat hello-gpu_<job-id>.out

Managing Node Sets

Adding Another Node Set

You can attach multiple node sets (for example, to include different GPU types) to the same cluster:

crusoe slurm nodesets create \
--name a100-workers \
--cluster-name my-slurm-cluster \
--type a100-80gb-sxm-ib.8x \
--count 4 \
--ib-partition-id <ib-partition-id>

Listing Node Sets

crusoe slurm nodesets list --cluster-name my-slurm-cluster

Getting Node Set Details

crusoe slurm nodesets get gpu-workers --cluster-name my-slurm-cluster

Deleting a Node Set

crusoe slurm nodesets delete gpu-workers --cluster-name my-slurm-cluster

Deleting a Cluster

To delete a Slurm cluster, first remove all node sets, then delete the cluster:

crusoe slurm nodesets delete gpu-workers --cluster-name my-slurm-cluster
crusoe slurm clusters delete my-slurm-cluster
danger

Deleting a cluster permanently removes all Slurm state, user data on the shared volume, and job history. This action cannot be undone.

Slurm Commands Reference

Once connected to a login node, use standard Slurm commands to manage jobs:

CommandDescription
sinfoView cluster status and node information
squeueView the job queue
sbatchSubmit a batch job
srunRun a job interactively
scancelCancel a job

For GPU jobs, specify GPU requirements using the --gpus flag:

srun --gpus=1 nvidia-smi          # Request 1 GPU
srun --gpus=8 my-training-script # Request 8 GPUs (full node)

Troubleshooting

Common Issues

IssueResolution
Nodes in drain stateCheck node reasons with sinfo -R to identify configuration issues
GPU not detectedCheck node set status via crusoe slurm nodesets get <name> --cluster-name <cluster>
Job allocation failuresCheck available resources with sinfo and verify job requirements are within cluster capacity
SSH connection refusedEnsure the cluster is in RUNNING state and your SSH key matches the one used during creation

Checking Cluster Health

From a login node:

sinfo                   # Check node states
scontrol show node # View detailed node information
scontrol show config # View Slurm configuration

Next Steps