Skip to main content

Advanced: Kubernetes Operations

Managed Slurm clusters run on Crusoe Managed Kubernetes (CMK). While the CLI and API handle most operations, you can also interact with the cluster directly via kubectl for advanced configuration, troubleshooting, and user management.

Accessing Your Cluster via kubectl

Configure kubectl to connect to your cluster:

crusoe kubernetes clusters get-credentials <cluster-name>

Verify the connection:

kubectl cluster-info

Viewing Cluster State

Slurm Custom Resources

The Crusoe Slurm Operator (CSO) uses four Custom Resource types to manage your cluster. When you create a Slurm cluster through the Slurm UI or crusoe slurm CLI, these CRDs will be created, controlled, and reconciled by Crusoe in the slurm namespace.

kubectl get slurmclusters -n slurm # Clusters
kubectl get nodesets -n slurm # Node sets
kubectl get slurmusers -n slurm # All users
kubectl get slurmusergroups -n slurm # All groups

For detailed status with conditions:

kubectl describe slurmclusters <name> -n slurm
tip

Use kubectl explain for field-level documentation on any Custom Resource. The CRDs have built-in descriptions for every field:

kubectl explain slurmclusters.spec
kubectl explain slurmusers.spec
kubectl explain slurmusergroups.spec
kubectl explain nodesets.spec

Cluster Phase and Conditions

The SlurmCluster status includes a phase field (possible values: Provisioning, Installing, Ready, NotReady, CreateFailed, Deleting, DeleteFailed) and detailed conditions for each component. Use kubectl describe to see what's happening if your cluster isn't healthy:

kubectl describe slurmclusters <name> -n slurm

Look for conditions like:

  • ControllerReady — Slurm controller pod is running
  • LoginReady — Login nodes are running
  • SlinkyReady — Slinky Helm chart is installed
  • CertManagerReady — cert-manager is installed
  • LoadBalancerReady — Load balancer controller is installed
  • TopographReady — Topology discovery is running

Node Set Readiness

The NodeSet status shows a readyReplicas field in "ready/total" format:

kubectl get nodesets -n slurm
NAME READY AGE
slurm-worker-node-set 2/2 1h

Storage Configuration

Managed Slurm uses a shared filesystem for the /home directory, mounted across all login and worker nodes. This is backed by a PersistentVolumeClaim (PVC) using the Crusoe CSI driver.

StorageClass

The operator automatically creates a StorageClass named <cluster-name>-crusoe-csi-driver-fs-sc with:

  • Provisioner: fs.csi.crusoe.ai
  • Volume binding mode: WaitForFirstConsumer
  • Volume expansion: Enabled

Viewing Storage

kubectl get pvc -n slurm # View persistent volume claims
kubectl get sc # View storage classes

What the Operator Manages

The Crusoe Slurm Operator (CSO) continuously reconciles certain resources to keep your cluster in a healthy state. Understanding what CSO manages helps you know what's safe to modify and what will be reverted.

Reconciliation Reference

ResourceManaged by CSO?Safe to Modify?Notes
SlurmUser CRsNo — customer-ownedYesThis is the intended way to manage users
SlurmUserGroup CRsNo — customer-ownedYesThis is the intended way to manage groups
SlurmClusterHealthCheck (-custom)Created once — never overwrittenYesAdd your own health/prolog/epilog checks here (see Node Health Checks)
SlurmClusterHealthCheck (-defaults)Yes — seeded and owned by CSONoCrusoe-provided node health checks
Slinky CRDs (Controller, LoginSet, NodeSet)Yes — full overwrite on every reconcileNoChanges will be reverted automatically
gres-config ConfigMapYes — reconciled every cycleNoHardcoded to AutoDetect=nvidia
plugstack-config ConfigMapYes — reconciled every cycleNoLoads the Pyxis SPANK plugin via plugstack.conf
Auth Secrets (slurm-auth, jwt-auth)Create-once — never regeneratedDo not modifyBreaking these breaks cluster authentication
nsscache ConfigMapYes — regenerated on user changesNoAuto-managed from SlurmUser CRs
ssh-keys SecretYes — regenerated on user changesNoAuto-managed from SlurmUser CRs
topology.conf ConfigMapCreated by CSO, then managed by TopographNoTopograph updates this automatically based on network topology
Home PVCValidated but not overwritten after creationStorage size only (increase)Spec is immutable after initial creation
StorageClassesYes — reconciled every cycleNoProvisioner and settings are fixed
Your own resourcesNeverYesCSO ignores any resources it doesn't own
info

Key takeaway: The resources you should interact with are SlurmUser and SlurmUserGroup CRs (see User Management) and the SlurmClusterHealthCheck -custom object (see Node Health Checks). Everything else is either managed by CSO or managed via the Crusoe CLI/API.

Slurm Commands Reference

Once connected to a login node via SSH, use standard Slurm commands:

CommandDescription
sinfoView cluster status and node information
sinfo -RView nodes in drain state with reasons
squeueView the job queue
sbatchSubmit a batch job
srunRun a job interactively
scancelCancel a job
scontrol show nodeView detailed node information
scontrol show configView Slurm configuration

Running NCCL Tests

To validate multi-node GPU communication, run an NCCL all-reduce test. SSH into a login node and create the following script named nccl_test.batch. This example runs NCCL tests on H200 nodes. Different hardware types will have their own topo files. Note that you can find more test examples on your login node at /opt/examples:

#!/bin/bash

#SBATCH --job-name=nccl_tests
#SBATCH --nodes=<number-of-nodes>
#SBATCH --ntasks-per-node=8
#SBATCH --gpus-per-node=8
#SBATCH --time=20:00
#SBATCH --output="%x_%j.out"
#SBATCH --exclusive

export NCCL_TOPO_FILE=/etc/crusoe/nccl_topo/h200-141gb-sxm-ib-cloud-hypervisor.xml
export NCCL_SOCKET_IFNAME=eth0
export NCCL_IB_HCA="mlx5_1:1,mlx5_2:1,mlx5_3:1,mlx5_4:1,mlx5_5:1,mlx5_6:1,mlx5_7:1,mlx5_8:1"
export NCCL_IB_MERGE_VFS=0
export NCCL_DEBUG=WARN

export OMPI_MCA_coll_hcoll_enable=0
export PMIX_MCA_gds='^ds12'

export UCX_NET_DEVICES="mlx5_1:1,mlx5_2:1,mlx5_3:1,mlx5_4:1,mlx5_5:1,mlx5_6:1,mlx5_7:1,mlx5_8:1"

srun --mpi=pmix /opt/nccl-tests/build/all_reduce_perf -b 2G -e 32G -f 2
note

Update the NCCL_TOPO_FILE path to match your GPU type. The example above is for H200 nodes.

Submit the test:

sbatch nccl_test.batch

Monitor with squeue, and check the output file nccl_tests_<job_id>.out once complete.

To connect interactively to a worker node:

srun --pty bash # Any available worker
srun --nodelist=<worker-pod-name> --pty bash # A specific worker

Health Checks, Prolog, and Epilog

Managed Slurm ships a built-in node health-check suite (periodic, prolog, and epilog) and lets you add your own health, prolog, and epilog checks through the SlurmClusterHealthCheck custom resource. See Node Health Checks.

Automatic Hardware Remediation

Managed Slurm clusters run on Crusoe Managed Kubernetes with AutoClusters enabled. AutoClusters automatically detects critical hardware failures — such as a GPU or HCA falling off the bus — and remediates them without manual intervention.

What Happens During Remediation

  1. Crusoe's monitoring pipeline detects a hardware issue on a worker node
  2. The affected Slurm node is set to DOWN, which immediately cancels any running job on that node — the job process receives a SIGTERM signal before being terminated
  3. You have up to 2 minutes to handle the SIGTERM (save checkpoints, flush logs, etc.) before the node is replaced
  4. The cancelled job is automatically requeued (JobRequeue=1 is enabled by default) and runs on a healthy node

Handling SIGTERM in Your Jobs

When a node goes down, Slurm sends SIGTERM to your job process. You can use trap to catch this signal and perform cleanup before the job is cancelled. For example:

#!/bin/bash
#SBATCH --job-name=my-training-job
#SBATCH --nodes=1
#SBATCH --output=%x-%j.out

trap 'echo "[$(date -u +%Y-%m-%dT%H:%M:%SZ)] SIGTERM received — saving checkpoint"; save_checkpoint' SIGTERM

run_training &
wait $!

Next Steps

Support

If you encounter issues or need assistance, contact Crusoe Cloud Support.