Quickstart
This guide walks you through creating a Managed Slurm cluster, adding GPU worker nodes, and running your first job — all from the Crusoe CLI. You can also create Slurm clusters through the Crusoe Cloud Console by navigating to Orchestration > Slurm in the left-hand navigation pane.
Prerequisites
- Make sure you are using the latest
crusoeCLI version - Reach out to customer support to check if you have high enough quotas to create Slurm clusters and associated underlying resources.
Step 1: Create a Slurm Cluster
Create a new Managed Slurm cluster with a single command:
crusoe slurm clusters create \
--name my-slurm-cluster \
--location us-southcentral1-a \
--keyfile ~/.ssh/id_ed25519.pub \
This command provisions a complete Slurm environment including the underlying Kubernetes cluster, Slurm controller, login nodes, and shared storage. The required add-ons are automatically included.
Required flags:
| Flag | Description |
|---|---|
--name | Name for your Slurm cluster |
--location | Crusoe Cloud location (e.g., us-east1-a) |
--keyfile | Path to your SSH public key file for root access to login nodes |
Optional flags:
| Flag | Default | Description |
|---|---|---|
--login-node-type | c1a.8x | Instance type for login nodes. Only CPU types are supported. |
--login-replicas | 2 | Number of login node replicas. Minimum 1, maximum 10. |
--home-volume-size | 10Ti | Shared /home volume size. Format: <n>Ti where n >= 1 and n <= 1000 (e.g. 1Ti, 10Ti, 50Ti) |
--subnet-id | — | Subnet ID for the cluster |
Cluster creation typically takes around 30 minutes. The command will wait for the operation to complete and display the result.
Step 2: Check Cluster Status
Verify your cluster is running:
crusoe slurm clusters get my-slurm-cluster
Example output:
name: my-slurm-cluster
id: a1b2c3d4-e5f6-7890-abcd-ef1234567890
state: RUNNING
location: us-southcentral1-a
login node endpoint: 160.211.64.102
login node type: c1a.8x
login replicas: 2
nodesets: []
home volume size: 10Ti
subnet id: 5efd0079-bf7e-4e0a-b879-b9af83ac3cac
root ssh pub keys: [ssh-ed25519 AAAA...]
Wait until the state field shows RUNNING before proceeding.
To list all your Slurm clusters:
crusoe slurm clusters list
Step 3: Add GPU Worker Nodes
Add a node set to provide GPU compute capacity:
crusoe slurm nodesets create \
--name gpu-workers \
--cluster-name my-slurm-cluster \
--type h100-80gb-sxm-ib.8x \
--count 2 \
--ib-partition-id <ib-partition-id>
Required flags:
| Flag | Description |
|---|---|
--name | Name for the node set |
--cluster-name or --cluster-id | The Slurm cluster to attach to |
--type | GPU instance type (see Supported GPU Types) |
--count | Number of worker nodes |
Optional flags:
| Flag | Description |
|---|---|
--ib-partition-id | InfiniBand partition ID for high-speed interconnect |
--keyfile | Path to SSH public key file for worker node access |
--subnet-id | Subnet for the node pool |
Step 4: Verify Node Set Status
Check that your worker nodes are ready:
crusoe slurm nodesets list --cluster-name my-slurm-cluster
Example output:
name id type count state
gpu-workers b2c3d4e5-f6a7-8901-bcde-f12345678901 h100-80gb-sxm-ib.8x 2 RUNNING
Wait until the state shows RUNNING. You can also check a specific node set:
crusoe slurm nodesets get gpu-workers --cluster-name my-slurm-cluster
Step 5: Connect to Your Cluster
Use the login node endpoint from Step 2 to SSH into your cluster:
ssh root@<login-node-endpoint>
If you are prompted for a password, specify your private key explicitly: ssh -i ~/.ssh/id_ed25519 root@<login-node-endpoint>
Once connected, verify your Slurm cluster is healthy:
sinfo
You should see your worker nodes in an idle state, ready to accept jobs.
Step 6: Run Your First Job
Interactive GPU Test
Run a quick interactive test to verify GPU access:
srun --gpus=8 nvidia-smi
This allocates a worker node with 8 GPUs and runs nvidia-smi, displaying GPU information.
Batch Job
Create a batch job script named hello-gpu.batch:
#!/bin/bash
#SBATCH --job-name=hello-gpu
#SBATCH --nodes=1
#SBATCH --gpus-per-node=8
#SBATCH --time=5:00
#SBATCH --output=/home/hello-gpu_%j.out
srun nvidia-smi
Always write job output to a path under /home. /home is the shared volume mounted on every login and worker node, so output written there is visible from the login node where you run sbatch. If you use a relative path or any other local path, the file lands on the worker node's local filesystem (typically /root for the root user), and the cat step below will fail with No such file or directory.
Submit the job:
sbatch hello-gpu.batch
Monitor the job:
squeue
Once the job completes, check the output:
cat /home/hello-gpu_<job-id>.out
Managing Node Sets
Adding Another Node Set
You can attach multiple node sets (for example, to include different GPU types) to the same cluster:
crusoe slurm nodesets create \
--name a100-workers \
--cluster-name my-slurm-cluster \
--type a100-80gb-sxm-ib.8x \
--count 4 \
--ib-partition-id <ib-partition-id>
Listing Node Sets
crusoe slurm nodesets list --cluster-name my-slurm-cluster
Getting Node Set Details
crusoe slurm nodesets get gpu-workers --cluster-name my-slurm-cluster
Deleting a Node Set
crusoe slurm nodesets delete gpu-workers --cluster-name my-slurm-cluster
Deleting a Cluster
To delete a Slurm cluster, first remove all node sets, then delete the cluster:
crusoe slurm nodesets delete gpu-workers --cluster-name my-slurm-cluster
crusoe slurm clusters delete my-slurm-cluster
Deleting a cluster removes the Slurm controller, login nodes, node sets, and all Slurm state (job history, running jobs, configuration). This action cannot be undone.
The shared /home volume is not deleted automatically. It is preserved so you don't lose data if you accidentally delete a cluster, and so you can recover or migrate the data afterward. To delete the volume, you must remove it manually. If you are having trouble finding the right volume to delete, contact Crusoe Cloud Support for help. Note that the volume will continue to incur storage charges until it is deleted.
Slurm Commands Reference
Once connected to a login node, use standard Slurm commands to manage jobs:
| Command | Description |
|---|---|
sinfo | View cluster status and node information |
squeue | View the job queue |
sbatch | Submit a batch job |
srun | Run a job interactively |
scancel | Cancel a job |
For GPU jobs, specify GPU requirements using the --gpus flag:
srun --gpus=1 nvidia-smi # Request 1 GPU
srun --gpus=8 my-training-script # Request 8 GPUs (full node)
Troubleshooting
Common Issues
| Issue | Resolution |
|---|---|
| Nodes in drain state | Check node reasons with sinfo -R to identify configuration issues |
| GPU not detected | Check node set status via crusoe slurm nodesets get <name> --cluster-name <cluster> |
| Job allocation failures | Check available resources with sinfo and verify job requirements are within cluster capacity |
| SSH connection refused | Ensure the cluster is in RUNNING state and your SSH key matches the one used during creation |
Checking Cluster Health
From a login node:
sinfo # Check node states
scontrol show node # View detailed node information
scontrol show config # View Slurm configuration
Next Steps
- User Management — Add users and groups to your cluster
- Managing Partitions — Create and manage partitions in your Slurm cluster
- Slurm Metrics — Monitor cluster health and performance
- Advanced: Kubernetes Operations — Direct kubectl access and CRD-level configuration
- For Slurm command reference, see the official Slurm documentation