Advanced: Kubernetes Operations
Managed Slurm clusters run on Crusoe Managed Kubernetes (CMK). While the CLI and API handle most operations, you can also interact with the cluster directly via kubectl for advanced configuration, troubleshooting, and user management.
Accessing Your Cluster via kubectl
Configure kubectl to connect to your cluster:
crusoe kubernetes clusters get-credentials <cluster-name>
Verify the connection:
kubectl cluster-info
Viewing Cluster State
Slurm Custom Resources
The Crusoe Slurm Operator (CSO) uses four Custom Resource types to manage your cluster. When you create a Slurm cluster through the Slurm UI or crusoe slurm CLI, these CRDs will be created, controlled, and reconciled by Crusoe in the slurm namespace.
kubectl get slurmclusters -n slurm # Clusters
kubectl get nodesets -n slurm # Node sets
kubectl get slurmusers -n slurm # All users
kubectl get slurmusergroups -n slurm # All groups
For detailed status with conditions:
kubectl describe slurmclusters <name> -n slurm
Use kubectl explain for field-level documentation on any Custom Resource. The CRDs have built-in descriptions for every field:
kubectl explain slurmclusters.spec
kubectl explain slurmusers.spec
kubectl explain slurmusergroups.spec
kubectl explain nodesets.spec
Cluster Phase and Conditions
The SlurmCluster status includes a phase field (possible values: Provisioning, Installing, Ready, NotReady, CreateFailed, Deleting, DeleteFailed) and detailed conditions for each component. Use kubectl describe to see what's happening if your cluster isn't healthy:
kubectl describe slurmclusters <name> -n slurm
Look for conditions like:
ControllerReady— Slurm controller pod is runningLoginReady— Login nodes are runningSlinkyReady— Slinky Helm chart is installedCertManagerReady— cert-manager is installedLoadBalancerReady— Load balancer controller is installedTopographReady— Topology discovery is running
Node Set Readiness
The NodeSet status shows a readyReplicas field in "ready/total" format:
kubectl get nodesets -n slurm
NAME READY AGE
slurm-worker-node-set 2/2 1h
Storage Configuration
Managed Slurm uses a shared filesystem for the /home directory, mounted across all login and worker nodes. This is backed by a PersistentVolumeClaim (PVC) using the Crusoe CSI driver.
StorageClass
The operator automatically creates a StorageClass named <cluster-name>-crusoe-csi-driver-fs-sc with:
- Provisioner:
fs.csi.crusoe.ai - Volume binding mode: WaitForFirstConsumer
- Volume expansion: Enabled
Viewing Storage
kubectl get pvc -n slurm # View persistent volume claims
kubectl get sc # View storage classes
What the Operator Manages
The Crusoe Slurm Operator (CSO) continuously reconciles certain resources to keep your cluster in a healthy state. Understanding what CSO manages helps you know what's safe to modify and what will be reverted.
Reconciliation Reference
| Resource | Managed by CSO? | Safe to Modify? | Notes |
|---|---|---|---|
| SlurmUser CRs | No — customer-owned | Yes | This is the intended way to manage users |
| SlurmUserGroup CRs | No — customer-owned | Yes | This is the intended way to manage groups |
| Slinky CRDs (Controller, LoginSet, NodeSet) | Yes — full overwrite on every reconcile | No | Changes will be reverted automatically |
| gres-config ConfigMap | Yes — reconciled every cycle | No | Hardcoded to AutoDetect=nvidia |
| pyxis-config ConfigMap | Yes — reconciled every cycle (if Pyxis enabled) | No | Hardcoded default config |
| Auth Secrets (slurm-auth, jwt-auth) | Create-once — never regenerated | Do not modify | Breaking these breaks cluster authentication |
| nsscache ConfigMap | Yes — regenerated on user changes | No | Auto-managed from SlurmUser CRs |
| ssh-keys Secret | Yes — regenerated on user changes | No | Auto-managed from SlurmUser CRs |
| topology.conf ConfigMap | Created by CSO, then managed by Topograph | No | Topograph updates this automatically based on network topology |
| Home PVC | Validated but not overwritten after creation | Storage size only (increase) | Spec is immutable after initial creation |
| StorageClasses | Yes — reconciled every cycle | No | Provisioner and settings are fixed |
| Your own resources | Never | Yes | CSO ignores any resources it doesn't own |
Key takeaway: The resources you should interact with are SlurmUser and SlurmUserGroup CRs (see User Management). Everything else is either managed by CSO or managed via the Crusoe CLI/API.
Slurm Commands Reference
Once connected to a login node via SSH, use standard Slurm commands:
| Command | Description |
|---|---|
sinfo | View cluster status and node information |
sinfo -R | View nodes in drain state with reasons |
squeue | View the job queue |
sbatch | Submit a batch job |
srun | Run a job interactively |
scancel | Cancel a job |
scontrol show node | View detailed node information |
scontrol show config | View Slurm configuration |
Running NCCL Tests
To validate multi-node GPU communication, run an NCCL all-reduce test. SSH into a login node and create the following script named nccl_test.batch. This example runs NCCL tests on H200 nodes. Different hardware types will have their own topo files. Note that you can find more test examples on your login node at /opt/examples:
#!/bin/bash
#SBATCH --job-name=nccl_tests
#SBATCH --nodes=<number-of-nodes>
#SBATCH --ntasks-per-node=8
#SBATCH --gpus-per-node=8
#SBATCH --time=20:00
#SBATCH --output="%x_%j.out"
#SBATCH --exclusive
export NCCL_TOPO_FILE=/etc/crusoe/nccl_topo/h200-141gb-sxm-ib-cloud-hypervisor.xml
export NCCL_SOCKET_IFNAME=eth0
export NCCL_IB_HCA="mlx5_1:1,mlx5_2:1,mlx5_3:1,mlx5_4:1,mlx5_5:1,mlx5_6:1,mlx5_7:1,mlx5_8:1"
export NCCL_IB_MERGE_VFS=0
export NCCL_DEBUG=WARN
export OMPI_MCA_coll_hcoll_enable=0
export PMIX_MCA_gds='^ds12'
export UCX_NET_DEVICES="mlx5_1:1,mlx5_2:1,mlx5_3:1,mlx5_4:1,mlx5_5:1,mlx5_6:1,mlx5_7:1,mlx5_8:1"
srun --mpi=pmix /opt/nccl-tests/build/all_reduce_perf -b 2G -e 32G -f 2
Update the NCCL_TOPO_FILE path to match your GPU type. The example above is for H200 nodes.
Submit the test:
sbatch nccl_test.batch
Monitor with squeue, and check the output file nccl_tests_<job_id>.out once complete.
To connect interactively to a worker node:
srun --pty bash # Any available worker
srun --nodelist=<worker-pod-name> --pty bash # A specific worker
Prolog and Epilog Scripts
Prolog and epilog scripts run automatically at the start and end of jobs. Managed Slurm supports two approaches: cluster-wide scripts configured by an admin, and per-job scripts specified by individual users.
Cluster-Wide Prolog/Epilog (Admin-Configured)
An admin places dispatcher scripts at a fixed path on the shared /home filesystem. Slurm calls them as root on every worker node at job start and end. The dispatcher can optionally call a per-user script if the user has created one.
Step 1 — Create the dispatcher scripts
From a login node:
mkdir -p /home/scripts
cat > /home/scripts/prolog.sh << 'EOF'
#!/bin/bash
USER_PROLOG="/home/${SLURM_JOB_USER}/prolog.sh"
if [ -x "$USER_PROLOG" ]; then
su "$SLURM_JOB_USER" -s /bin/bash -c "$USER_PROLOG"
EXIT_CODE=$?
if [ $EXIT_CODE -ne 0 ]; then
echo "$(date) job=$SLURM_JOB_ID node=$(hostname): prolog.sh failed with exit code $EXIT_CODE" \
>> "/home/${SLURM_JOB_USER}/prolog-errors.log"
fi
fi
exit 0
EOF
cat > /home/scripts/epilog.sh << 'EOF'
#!/bin/bash
USER_EPILOG="/home/${SLURM_JOB_USER}/epilog.sh"
if [ -x "$USER_EPILOG" ]; then
su "$SLURM_JOB_USER" -s /bin/bash -c "$USER_EPILOG"
EXIT_CODE=$?
if [ $EXIT_CODE -ne 0 ]; then
echo "$(date) job=$SLURM_JOB_ID node=$(hostname): epilog.sh failed with exit code $EXIT_CODE" \
>> "/home/${SLURM_JOB_USER}/epilog-errors.log"
fi
fi
exit 0
EOF
chmod 755 /home/scripts/prolog.sh /home/scripts/epilog.sh /home/scripts
Step 2 — Add Prolog/Epilog to the Controller CR
Edit the Controller CR and add these lines to spec.extraConf above the injected section markers (see Managing Partitions for details on editing spec.extraConf):
spec:
extraConf: |
Prolog=/home/scripts/prolog.sh
Epilog=/home/scripts/epilog.sh
# THE FOLLOWING SETTINGS ARE AUTOMATICALLY INJECTED BY CRUSOE SLURM OPERATOR
# ===============================START======================================
...
Step 3 — Reconfigure Slurm
scontrol reconfigure
Per-user scripts (optional)
Each user can create their own ~/prolog.sh and ~/epilog.sh. The dispatcher scripts above will call them automatically if they exist and are executable:
cat > ~/prolog.sh << 'EOF'
#!/bin/bash
echo "Job $SLURM_JOB_ID starting on $(hostname)" >> ~/prolog.log
exit 0
EOF
chmod +x ~/prolog.sh
If ~/prolog.sh is absent or not executable, it is silently skipped.
Key behaviors:
- Runs once per node (not per task), as root
- User scripts run as the submitting user via
su SLURM_*environment variables are available- Script output goes to
/var/log/slurm/slurmd.logon each worker, not the job output file — redirect to a file in/homeif you need visible output - Non-zero exit from the cluster-wide script requeues the job (
JobRequeue=1is set by default)
Per-Job Task Prolog/Epilog (User-Configured)
Users can specify prolog and epilog scripts per job using --task-prolog and --task-epilog. No admin setup is required.
In an sbatch script:
#!/bin/bash
#SBATCH --nodes=2
#SBATCH --task-prolog=/home/alice/setup.sh
#SBATCH --task-epilog=/home/alice/teardown.sh
srun ./myprogram
Or inline with srun:
srun --task-prolog=/home/alice/setup.sh --task-epilog=/home/alice/teardown.sh ./myprogram
Key behaviors:
- Runs once per task (not per node), as the submitting user
- stdout appears in the job's output file
- Non-zero exit logs a warning but does not requeue the job
- No admin involvement required
Execution Order
When both cluster-wide and per-job scripts are configured, they execute in this order:
Job allocated
↓
Prolog=/home/scripts/prolog.sh (per node, as root)
└─ ~/prolog.sh if present (per node, as user)
↓
--task-prolog (per task, as user)
↓
srun ./myprogram
↓
--task-epilog (per task, as user)
↓
Epilog=/home/scripts/epilog.sh (per node, as root)
└─ ~/epilog.sh if present (per node, as user)
Automatic Hardware Remediation
Managed Slurm clusters run on Crusoe Managed Kubernetes with AutoClusters enabled. AutoClusters automatically detects critical hardware failures — such as a GPU or HCA falling off the bus — and remediates them without manual intervention.
What Happens During Remediation
- Crusoe's monitoring pipeline detects a hardware issue on a worker node
- The affected Slurm node is set to DOWN, which immediately cancels any running job on that node — the job process receives a SIGTERM signal before being terminated
- You have up to 2 minutes to handle the SIGTERM (save checkpoints, flush logs, etc.) before the node is replaced
- The cancelled job is automatically requeued (
JobRequeue=1is enabled by default) and runs on a healthy node
Handling SIGTERM in Your Jobs
When a node goes down, Slurm sends SIGTERM to your job process. You can use trap to catch this signal and perform cleanup before the job is cancelled. For example:
#!/bin/bash
#SBATCH --job-name=my-training-job
#SBATCH --nodes=1
#SBATCH --output=%x-%j.out
trap 'echo "[$(date -u +%Y-%m-%dT%H:%M:%SZ)] SIGTERM received — saving checkpoint"; save_checkpoint' SIGTERM
run_training &
wait $!
Next Steps
- Quickstart — Create your first Slurm cluster
- User Management — Add users and groups, manage partitions
- Managing Partitions — Create and manage partitions in your Slurm cluster
- Slurm Metrics — Monitor cluster health and performance
- For Slurm command reference, see the official Slurm documentation
Support
If you encounter issues or need assistance, contact Crusoe Cloud Support.