Quickstart

This guide walks you through creating a Managed Slurm cluster, adding GPU worker nodes, and running your first job — all from the Crusoe CLI. You can also create Slurm clusters through the Crusoe Cloud Console by navigating to Orchestration > Slurm in the left-hand navigation pane.

Prerequisites

Make sure you are using the latest crusoe CLI version
Reach out to customer support to check if you have high enough quotas to create Slurm clusters and associated underlying resources.

Step 1: Create a Slurm Cluster

Create a new Managed Slurm cluster with a single command:

crusoe slurm clusters create \
  --name my-slurm-cluster \
  --location us-southcentral1-a \
  --keyfile ~/.ssh/id_ed25519.pub \

This command provisions a complete Slurm environment including the underlying Kubernetes cluster, Slurm controller, login nodes, and shared storage. The required add-ons are automatically included.

Required flags:

Flag	Description
`--name`	Name for your Slurm cluster
`--location`	Crusoe Cloud location (e.g., `us-east1-a`)
`--keyfile`	Path to your SSH public key file for root access to login nodes

Optional flags:

Flag	Default	Description
`--login-node-type`	`c1a.8x`	Instance type for login nodes. Only CPU types are supported.
`--login-replicas`	`2`	Number of login node replicas. Minimum 1, maximum 10.
`--home-volume-size`	`10Ti`	Shared `/home` volume size. Format: `<n>Ti` where n >= 1 and n <= 1000 (e.g. `1Ti`, `10Ti`, `50Ti`)
`--subnet-id`	—	Subnet ID for the cluster

note

Cluster creation typically takes around 30 minutes. The command will wait for the operation to complete and display the result.

Step 2: Check Cluster Status

Verify your cluster is running:

crusoe slurm clusters get my-slurm-cluster

Example output:

name: my-slurm-cluster
id: a1b2c3d4-e5f6-7890-abcd-ef1234567890
state: RUNNING
location: us-southcentral1-a
login node endpoint: 160.211.64.102
login node type: c1a.8x
login replicas: 2
nodesets: []
home volume size: 10Ti
subnet id: 5efd0079-bf7e-4e0a-b879-b9af83ac3cac
root ssh pub keys: [ssh-ed25519 AAAA...]

Wait until the state field shows RUNNING before proceeding.

To list all your Slurm clusters:

crusoe slurm clusters list

Step 3: Add GPU Worker Nodes

Add a node set to provide GPU compute capacity:

crusoe slurm nodesets create \
  --name gpu-workers \
  --cluster-name my-slurm-cluster \
  --type h100-80gb-sxm-ib.8x \
  --count 2 \
  --ib-partition-id <ib-partition-id>

Required flags:

Flag	Description
`--name`	Name for the node set
`--cluster-name` or `--cluster-id`	The Slurm cluster to attach to
`--type`	GPU instance type (see Supported GPU Types)
`--count`	Number of worker nodes

Optional flags:

Flag	Description
`--ib-partition-id`	InfiniBand partition ID for high-speed interconnect
`--keyfile`	Path to SSH public key file for worker node access
`--subnet-id`	Subnet for the node pool

Step 4: Verify Node Set Status

Check that your worker nodes are ready:

crusoe slurm nodesets list --cluster-name my-slurm-cluster

Example output:

name          id                                     type                   count  state
gpu-workers   b2c3d4e5-f6a7-8901-bcde-f12345678901   h100-80gb-sxm-ib.8x    2      RUNNING

Wait until the state shows RUNNING. You can also check a specific node set:

crusoe slurm nodesets get gpu-workers --cluster-name my-slurm-cluster

Step 5: Connect to Your Cluster

Use the login node endpoint from Step 2 to SSH into your cluster:

ssh root@<login-node-endpoint>

tip

If you are prompted for a password, specify your private key explicitly: ssh -i ~/.ssh/id_ed25519 root@<login-node-endpoint>

Once connected, verify your Slurm cluster is healthy:

sinfo

You should see your worker nodes in an idle state, ready to accept jobs.

Step 6: Run Your First Job

Interactive GPU Test

Run a quick interactive test to verify GPU access:

srun --gpus=8 nvidia-smi

This allocates a worker node with 8 GPUs and runs nvidia-smi, displaying GPU information.

Batch Job

Create a batch job script named hello-gpu.batch:

#!/bin/bash

#SBATCH --job-name=hello-gpu
#SBATCH --nodes=1
#SBATCH --gpus-per-node=8
#SBATCH --time=5:00
#SBATCH --output=/home/hello-gpu_%j.out

srun nvidia-smi

note

Always write job output to a path under /home. /home is the shared volume mounted on every login and worker node, so output written there is visible from the login node where you run sbatch. If you use a relative path or any other local path, the file lands on the worker node's local filesystem (typically /root for the root user), and the cat step below will fail with No such file or directory.

Submit the job:

sbatch hello-gpu.batch

Monitor the job:

squeue

Once the job completes, check the output:

cat /home/hello-gpu_<job-id>.out

Managing Node Sets

Adding Another Node Set

You can attach multiple node sets (for example, to include different GPU types) to the same cluster:

crusoe slurm nodesets create \
  --name a100-workers \
  --cluster-name my-slurm-cluster \
  --type a100-80gb-sxm-ib.8x \
  --count 4 \
  --ib-partition-id <ib-partition-id>

Listing Node Sets

crusoe slurm nodesets list --cluster-name my-slurm-cluster

Getting Node Set Details

crusoe slurm nodesets get gpu-workers --cluster-name my-slurm-cluster

Deleting a Node Set

crusoe slurm nodesets delete gpu-workers --cluster-name my-slurm-cluster

Deleting a Cluster

To delete a Slurm cluster, first remove all node sets, then delete the cluster:

crusoe slurm nodesets delete gpu-workers --cluster-name my-slurm-cluster
crusoe slurm clusters delete my-slurm-cluster

warning

Deleting a cluster removes the Slurm controller, login nodes, node sets, and all Slurm state (job history, running jobs, configuration). This action cannot be undone.

note

The shared /home volume is not deleted automatically. It is preserved so you don't lose data if you accidentally delete a cluster, and so you can recover or migrate the data afterward. To delete the volume, you must remove it manually. If you are having trouble finding the right volume to delete, contact Crusoe Cloud Support for help. Note that the volume will continue to incur storage charges until it is deleted.

Slurm Commands Reference

Once connected to a login node, use standard Slurm commands to manage jobs:

Command	Description
`sinfo`	View cluster status and node information
`squeue`	View the job queue
`sbatch`	Submit a batch job
`srun`	Run a job interactively
`scancel`	Cancel a job

For GPU jobs, specify GPU requirements using the --gpus flag:

srun --gpus=1 nvidia-smi          # Request 1 GPU
srun --gpus=8 my-training-script   # Request 8 GPUs (full node)

Troubleshooting

Common Issues

Issue	Resolution
Nodes in drain state	Check node reasons with `sinfo -R` to identify configuration issues
GPU not detected	Check node set status via `crusoe slurm nodesets get <name> --cluster-name <cluster>`
Job allocation failures	Check available resources with `sinfo` and verify job requirements are within cluster capacity
SSH connection refused	Ensure the cluster is in `RUNNING` state and your SSH key matches the one used during creation

Checking Cluster Health

From a login node:

sinfo                   # Check node states
scontrol show node      # View detailed node information
scontrol show config    # View Slurm configuration

Next Steps

User Management — Add users and groups to your cluster
Managing Partitions — Create and manage partitions in your Slurm cluster
Slurm Metrics — Monitor cluster health and performance
Advanced: Kubernetes Operations — Direct kubectl access and CRD-level configuration
For Slurm command reference, see the official Slurm documentation

Prerequisites​

Step 1: Create a Slurm Cluster​

Step 2: Check Cluster Status​

Step 3: Add GPU Worker Nodes​

Step 4: Verify Node Set Status​

Step 5: Connect to Your Cluster​

Step 6: Run Your First Job​

Interactive GPU Test​

Batch Job​

Managing Node Sets​

Adding Another Node Set​

Listing Node Sets​

Getting Node Set Details​

Deleting a Node Set​

Deleting a Cluster​

Slurm Commands Reference​

Troubleshooting​

Common Issues​

Checking Cluster Health​

Next Steps​

Prerequisites

Step 1: Create a Slurm Cluster

Step 2: Check Cluster Status

Step 3: Add GPU Worker Nodes

Step 4: Verify Node Set Status

Step 5: Connect to Your Cluster

Step 6: Run Your First Job

Interactive GPU Test

Batch Job

Managing Node Sets

Adding Another Node Set

Listing Node Sets

Getting Node Set Details

Deleting a Node Set

Deleting a Cluster

Slurm Commands Reference

Troubleshooting

Common Issues

Checking Cluster Health

Next Steps