Advanced: Kubernetes Operations
Managed Slurm clusters run on Crusoe Managed Kubernetes (CMK). While the CLI and API handle most operations, you can also interact with the cluster directly via kubectl for advanced configuration, troubleshooting, and user management.
Accessing Your Cluster via kubectl
Configure kubectl to connect to your cluster:
crusoe kubernetes clusters get-credentials <cluster-name>
Verify the connection:
kubectl cluster-info
Viewing Cluster State
Slurm Custom Resources
The Crusoe Slurm Operator (CSO) uses four Custom Resource types to manage your cluster. When you create a Slurm cluster through the Slurm UI or crusoe slurm CLI, these CRDs will be created, controlled, and reconciled by Crusoe in the slurm namespace.
kubectl get slurmclusters -n slurm # Clusters
kubectl get nodesets -n slurm # Node sets
kubectl get slurmusers -n slurm # All users
kubectl get slurmusergroups -n slurm # All groups
For detailed status with conditions:
kubectl describe slurmclusters <name> -n slurm
Use kubectl explain for field-level documentation on any Custom Resource. The CRDs have built-in descriptions for every field:
kubectl explain slurmclusters.spec
kubectl explain slurmusers.spec
kubectl explain slurmusergroups.spec
kubectl explain nodesets.spec
Cluster Phase and Conditions
The SlurmCluster status includes a phase field (possible values: Provisioning, Installing, Ready, NotReady, CreateFailed, Deleting, DeleteFailed) and detailed conditions for each component. Use kubectl describe to see what's happening if your cluster isn't healthy:
kubectl describe slurmclusters <name> -n slurm
Look for conditions like:
ControllerReady— Slurm controller pod is runningLoginReady— Login nodes are runningSlinkyReady— Slinky Helm chart is installedCertManagerReady— cert-manager is installedLoadBalancerReady— Load balancer controller is installedTopographReady— Topology discovery is running
Node Set Readiness
The NodeSet status shows a readyReplicas field in "ready/total" format:
kubectl get nodesets -n slurm
NAME READY AGE
slurm-worker-node-set 2/2 1h
Storage Configuration
Managed Slurm uses a shared filesystem for the /home directory, mounted across all login and worker nodes. This is backed by a PersistentVolumeClaim (PVC) using the Crusoe CSI driver.
StorageClass
The operator automatically creates a StorageClass named <cluster-name>-crusoe-csi-driver-fs-sc with:
- Provisioner:
fs.csi.crusoe.ai - Volume binding mode: WaitForFirstConsumer
- Volume expansion: Enabled
Viewing Storage
kubectl get pvc -n slurm # View persistent volume claims
kubectl get sc # View storage classes
What the Operator Manages
The Crusoe Slurm Operator (CSO) continuously reconciles certain resources to keep your cluster in a healthy state. Understanding what CSO manages helps you know what's safe to modify and what will be reverted.
Reconciliation Reference
| Resource | Managed by CSO? | Safe to Modify? | Notes |
|---|---|---|---|
| SlurmUser CRs | No — customer-owned | Yes | This is the intended way to manage users |
| SlurmUserGroup CRs | No — customer-owned | Yes | This is the intended way to manage groups |
SlurmClusterHealthCheck (-custom) | Created once — never overwritten | Yes | Add your own health/prolog/epilog checks here (see Node Health Checks) |
SlurmClusterHealthCheck (-defaults) | Yes — seeded and owned by CSO | No | Crusoe-provided node health checks |
| Slinky CRDs (Controller, LoginSet, NodeSet) | Yes — full overwrite on every reconcile | No | Changes will be reverted automatically |
| gres-config ConfigMap | Yes — reconciled every cycle | No | Hardcoded to AutoDetect=nvidia |
| plugstack-config ConfigMap | Yes — reconciled every cycle | No | Loads the Pyxis SPANK plugin via plugstack.conf |
| Auth Secrets (slurm-auth, jwt-auth) | Create-once — never regenerated | Do not modify | Breaking these breaks cluster authentication |
| nsscache ConfigMap | Yes — regenerated on user changes | No | Auto-managed from SlurmUser CRs |
| ssh-keys Secret | Yes — regenerated on user changes | No | Auto-managed from SlurmUser CRs |
| topology.conf ConfigMap | Created by CSO, then managed by Topograph | No | Topograph updates this automatically based on network topology |
| Home PVC | Validated but not overwritten after creation | Storage size only (increase) | Spec is immutable after initial creation |
| StorageClasses | Yes — reconciled every cycle | No | Provisioner and settings are fixed |
| Your own resources | Never | Yes | CSO ignores any resources it doesn't own |
Key takeaway: The resources you should interact with are SlurmUser and SlurmUserGroup CRs (see User Management) and the SlurmClusterHealthCheck -custom object (see Node Health Checks). Everything else is either managed by CSO or managed via the Crusoe CLI/API.
Slurm Commands Reference
Once connected to a login node via SSH, use standard Slurm commands:
| Command | Description |
|---|---|
sinfo | View cluster status and node information |
sinfo -R | View nodes in drain state with reasons |
squeue | View the job queue |
sbatch | Submit a batch job |
srun | Run a job interactively |
scancel | Cancel a job |
scontrol show node | View detailed node information |
scontrol show config | View Slurm configuration |
Running NCCL Tests
To validate multi-node GPU communication, run an NCCL all-reduce test. SSH into a login node and create the following script named nccl_test.batch. This example runs NCCL tests on H200 nodes. Different hardware types will have their own topo files. Note that you can find more test examples on your login node at /opt/examples:
#!/bin/bash
#SBATCH --job-name=nccl_tests
#SBATCH --nodes=<number-of-nodes>
#SBATCH --ntasks-per-node=8
#SBATCH --gpus-per-node=8
#SBATCH --time=20:00
#SBATCH --output="%x_%j.out"
#SBATCH --exclusive
export NCCL_TOPO_FILE=/etc/crusoe/nccl_topo/h200-141gb-sxm-ib-cloud-hypervisor.xml
export NCCL_SOCKET_IFNAME=eth0
export NCCL_IB_HCA="mlx5_1:1,mlx5_2:1,mlx5_3:1,mlx5_4:1,mlx5_5:1,mlx5_6:1,mlx5_7:1,mlx5_8:1"
export NCCL_IB_MERGE_VFS=0
export NCCL_DEBUG=WARN
export OMPI_MCA_coll_hcoll_enable=0
export PMIX_MCA_gds='^ds12'
export UCX_NET_DEVICES="mlx5_1:1,mlx5_2:1,mlx5_3:1,mlx5_4:1,mlx5_5:1,mlx5_6:1,mlx5_7:1,mlx5_8:1"
srun --mpi=pmix /opt/nccl-tests/build/all_reduce_perf -b 2G -e 32G -f 2
Update the NCCL_TOPO_FILE path to match your GPU type. The example above is for H200 nodes.
Submit the test:
sbatch nccl_test.batch
Monitor with squeue, and check the output file nccl_tests_<job_id>.out once complete.
To connect interactively to a worker node:
srun --pty bash # Any available worker
srun --nodelist=<worker-pod-name> --pty bash # A specific worker
Health Checks, Prolog, and Epilog
Managed Slurm ships a built-in node health-check suite (periodic, prolog, and epilog) and lets you add your own health, prolog, and epilog checks through the SlurmClusterHealthCheck custom resource. See Node Health Checks.
Automatic Hardware Remediation
Managed Slurm clusters run on Crusoe Managed Kubernetes with AutoClusters enabled. AutoClusters automatically detects critical hardware failures — such as a GPU or HCA falling off the bus — and remediates them without manual intervention.
What Happens During Remediation
- Crusoe's monitoring pipeline detects a hardware issue on a worker node
- The affected Slurm node is set to DOWN, which immediately cancels any running job on that node — the job process receives a SIGTERM signal before being terminated
- You have up to 2 minutes to handle the SIGTERM (save checkpoints, flush logs, etc.) before the node is replaced
- The cancelled job is automatically requeued (
JobRequeue=1is enabled by default) and runs on a healthy node
Handling SIGTERM in Your Jobs
When a node goes down, Slurm sends SIGTERM to your job process. You can use trap to catch this signal and perform cleanup before the job is cancelled. For example:
#!/bin/bash
#SBATCH --job-name=my-training-job
#SBATCH --nodes=1
#SBATCH --output=%x-%j.out
trap 'echo "[$(date -u +%Y-%m-%dT%H:%M:%SZ)] SIGTERM received — saving checkpoint"; save_checkpoint' SIGTERM
run_training &
wait $!
Next Steps
- Quickstart — Create your first Slurm cluster
- User Management — Add users and groups, manage partitions
- Managing Partitions — Create and manage partitions in your Slurm cluster
- Node Health Checks — Built-in health checks and adding your own health/prolog/epilog checks
- Slurm Metrics — Monitor cluster health and performance
- For Slurm command reference, see the official Slurm documentation
Support
If you encounter issues or need assistance, contact Crusoe Cloud Support.