Quickstart
This guide walks you through creating a Managed Slurm cluster, adding GPU worker nodes, and running your first job — all from the Crusoe CLI. You can also create Slurm clusters through the Crusoe Cloud Console by navigating to Orchestration > Slurm in the left-hand navigation pane.
Step 1: Create a Slurm Cluster
Create a new Managed Slurm cluster with a single command:
crusoe slurm clusters create \
--name my-slurm-cluster \
--location us-east1-a \
--keyfile ~/.ssh/id_ed25519.pub \
This command provisions a complete Slurm environment including the underlying Kubernetes cluster, Slurm controller, login nodes, and shared storage. The required add-ons are automatically included.
Required flags:
| Flag | Description |
|---|---|
--name | Name for your Slurm cluster |
--location | Crusoe Cloud location (e.g., us-east1-a) |
--keyfile | Path to your SSH public key file for root access to login nodes |
Optional flags:
| Flag | Default | Description |
|---|---|---|
--login-node-type | c1a.8x | Instance type for login nodes. Only CPU types are supported. |
--login-replicas | 2 | Number of login node replicas. Minimum 1, maximum 10. |
--home-volume-size | 1Ti | Shared /home volume size. Format: <n>Ti where n >= 1 and n <= 1000 (e.g. 1Ti, 10Ti, 50Ti) |
--subnet-id | — | Subnet ID for the cluster |
Cluster creation typically takes around 20 minutes. The command will wait for the operation to complete and display the result.
Step 2: Check Cluster Status
Verify your cluster is running:
crusoe slurm clusters get my-slurm-cluster
Example output:
name: my-slurm-cluster
id: a1b2c3d4-e5f6-7890-abcd-ef1234567890
state: RUNNING
location: us-east1-a
login node endpoint: 160.211.64.102
login node type: c1a.8x
login replicas: 2
nodesets: []
home volume size: 1Ti
subnet id: 5efd0079-bf7e-4e0a-b879-b9af83ac3cac
root ssh pub keys: [ssh-ed25519 AAAA...]
Wait until the state field shows RUNNING before proceeding.
To list all your Slurm clusters:
crusoe slurm clusters list
Step 3: Add GPU Worker Nodes
Add a node set to provide GPU compute capacity:
crusoe slurm nodesets create \
--name gpu-workers \
--cluster-name my-slurm-cluster \
--type h100-80gb-sxm-ib.8x \
--count 2 \
--ib-partition-id <ib-partition-id>
Required flags:
| Flag | Description |
|---|---|
--name | Name for the node set |
--cluster-name or --cluster-id | The Slurm cluster to attach to |
--type | GPU instance type (see Supported GPU Types) |
--count | Number of worker nodes |
Optional flags:
| Flag | Description |
|---|---|
--ib-partition-id | InfiniBand partition ID for high-speed interconnect |
--keyfile | Path to SSH public key file for worker node access |
--subnet-id | Subnet for the node pool |
Step 4: Verify Node Set Status
Check that your worker nodes are ready:
crusoe slurm nodesets list --cluster-name my-slurm-cluster
Example output:
name id type count state
gpu-workers b2c3d4e5-f6a7-8901-bcde-f12345678901 h100-80gb-sxm-ib.8x 2 RUNNING
Wait until the state shows RUNNING. You can also check a specific node set:
crusoe slurm nodesets get gpu-workers --cluster-name my-slurm-cluster
Step 5: Connect to Your Cluster
Use the login node endpoint from Step 2 to SSH into your cluster:
ssh root@<login-node-endpoint>
If you are prompted for a password, specify your private key explicitly: ssh -i ~/.ssh/id_ed25519 root@<login-node-endpoint>
Once connected, verify your Slurm cluster is healthy:
sinfo
You should see your worker nodes in an idle state, ready to accept jobs.
Step 6: Run Your First Job
Interactive GPU Test
Run a quick interactive test to verify GPU access:
srun --gpus=8 nvidia-smi
This allocates a worker node with 8 GPUs and runs nvidia-smi, displaying GPU information.
Batch Job
Create a batch job script named hello-gpu.batch:
#!/bin/bash
#SBATCH --job-name=hello-gpu
#SBATCH --nodes=1
#SBATCH --gpus-per-node=8
#SBATCH --time=5:00
#SBATCH --output=hello-gpu_%j.out
srun nvidia-smi
Submit the job:
sbatch hello-gpu.batch
Monitor the job:
squeue
Once the job completes, check the output:
cat hello-gpu_<job-id>.out
Managing Node Sets
Adding Another Node Set
You can attach multiple node sets (for example, to include different GPU types) to the same cluster:
crusoe slurm nodesets create \
--name a100-workers \
--cluster-name my-slurm-cluster \
--type a100-80gb-sxm-ib.8x \
--count 4 \
--ib-partition-id <ib-partition-id>
Listing Node Sets
crusoe slurm nodesets list --cluster-name my-slurm-cluster
Getting Node Set Details
crusoe slurm nodesets get gpu-workers --cluster-name my-slurm-cluster
Deleting a Node Set
crusoe slurm nodesets delete gpu-workers --cluster-name my-slurm-cluster
Deleting a Cluster
To delete a Slurm cluster, first remove all node sets, then delete the cluster:
crusoe slurm nodesets delete gpu-workers --cluster-name my-slurm-cluster
crusoe slurm clusters delete my-slurm-cluster
Deleting a cluster permanently removes all Slurm state, user data on the shared volume, and job history. This action cannot be undone.
Slurm Commands Reference
Once connected to a login node, use standard Slurm commands to manage jobs:
| Command | Description |
|---|---|
sinfo | View cluster status and node information |
squeue | View the job queue |
sbatch | Submit a batch job |
srun | Run a job interactively |
scancel | Cancel a job |
For GPU jobs, specify GPU requirements using the --gpus flag:
srun --gpus=1 nvidia-smi # Request 1 GPU
srun --gpus=8 my-training-script # Request 8 GPUs (full node)
Troubleshooting
Common Issues
| Issue | Resolution |
|---|---|
| Nodes in drain state | Check node reasons with sinfo -R to identify configuration issues |
| GPU not detected | Check node set status via crusoe slurm nodesets get <name> --cluster-name <cluster> |
| Job allocation failures | Check available resources with sinfo and verify job requirements are within cluster capacity |
| SSH connection refused | Ensure the cluster is in RUNNING state and your SSH key matches the one used during creation |
Checking Cluster Health
From a login node:
sinfo # Check node states
scontrol show node # View detailed node information
scontrol show config # View Slurm configuration
Next Steps
- User Management — Add users and groups to your cluster
- Managing Partitions — Create and manage partitions in your Slurm cluster
- Slurm Metrics — Monitor cluster health and performance
- Advanced: Kubernetes Operations — Direct kubectl access and CRD-level configuration
- For Slurm command reference, see the official Slurm documentation