Skip to main content

Advanced: Kubernetes Operations

Managed Slurm clusters run on Crusoe Managed Kubernetes (CMK). While the CLI and API handle most operations, you can also interact with the cluster directly via kubectl for advanced configuration, troubleshooting, and user management.

Accessing Your Cluster via kubectl

Configure kubectl to connect to your cluster:

crusoe kubernetes clusters get-credentials <cluster-name>

Verify the connection:

kubectl cluster-info

Viewing Cluster State

Slurm Custom Resources

The Crusoe Slurm Operator (CSO) uses four Custom Resource types to manage your cluster. When you create a Slurm cluster through the Slurm UI or crusoe slurm CLI, these CRDs will be created, controlled, and reconciled by Crusoe in the slurm namespace.

kubectl get slurmclusters -n slurm        # Clusters
kubectl get nodesets -n slurm # Node sets
kubectl get slurmusers -n slurm # All users
kubectl get slurmusergroups -n slurm # All groups

For detailed status with conditions:

kubectl describe slurmclusters <name> -n slurm
tip

Use kubectl explain for field-level documentation on any Custom Resource. The CRDs have built-in descriptions for every field:

kubectl explain slurmclusters.spec
kubectl explain slurmusers.spec
kubectl explain slurmusergroups.spec
kubectl explain nodesets.spec

Cluster Phase and Conditions

The SlurmCluster status includes a phase field (possible values: Provisioning, Installing, Ready, NotReady, CreateFailed, Deleting, DeleteFailed) and detailed conditions for each component. Use kubectl describe to see what's happening if your cluster isn't healthy:

kubectl describe slurmclusters <name> -n slurm

Look for conditions like:

  • ControllerReady — Slurm controller pod is running
  • LoginReady — Login nodes are running
  • SlinkyReady — Slinky Helm chart is installed
  • CertManagerReady — cert-manager is installed
  • LoadBalancerReady — Load balancer controller is installed
  • TopographReady — Topology discovery is running

Node Set Readiness

The NodeSet status shows a readyReplicas field in "ready/total" format:

kubectl get nodesets -n slurm
NAME                    READY   AGE
slurm-worker-node-set 2/2 1h

Storage Configuration

Managed Slurm uses a shared filesystem for the /home directory, mounted across all login and worker nodes. This is backed by a PersistentVolumeClaim (PVC) using the Crusoe CSI driver.

StorageClass

The operator automatically creates a StorageClass named <cluster-name>-crusoe-csi-driver-fs-sc with:

  • Provisioner: fs.csi.crusoe.ai
  • Volume binding mode: WaitForFirstConsumer
  • Volume expansion: Enabled

Viewing Storage

kubectl get pvc -n slurm         # View persistent volume claims
kubectl get sc # View storage classes

What the Operator Manages

The Crusoe Slurm Operator (CSO) continuously reconciles certain resources to keep your cluster in a healthy state. Understanding what CSO manages helps you know what's safe to modify and what will be reverted.

Reconciliation Reference

ResourceManaged by CSO?Safe to Modify?Notes
SlurmUser CRsNo — customer-ownedYesThis is the intended way to manage users
SlurmUserGroup CRsNo — customer-ownedYesThis is the intended way to manage groups
Slinky CRDs (Controller, LoginSet, NodeSet)Yes — full overwrite on every reconcileNoChanges will be reverted automatically
gres-config ConfigMapYes — reconciled every cycleNoHardcoded to AutoDetect=nvidia
pyxis-config ConfigMapYes — reconciled every cycle (if Pyxis enabled)NoHardcoded default config
Auth Secrets (slurm-auth, jwt-auth)Create-once — never regeneratedDo not modifyBreaking these breaks cluster authentication
nsscache ConfigMapYes — regenerated on user changesNoAuto-managed from SlurmUser CRs
ssh-keys SecretYes — regenerated on user changesNoAuto-managed from SlurmUser CRs
topology.conf ConfigMapCreated by CSO, then managed by TopographNoTopograph updates this automatically based on network topology
Home PVCValidated but not overwritten after creationStorage size only (increase)Spec is immutable after initial creation
StorageClassesYes — reconciled every cycleNoProvisioner and settings are fixed
Your own resourcesNeverYesCSO ignores any resources it doesn't own
info

Key takeaway: The resources you should interact with are SlurmUser and SlurmUserGroup CRs (see User Management). Everything else is either managed by CSO or managed via the Crusoe CLI/API.

Slurm Commands Reference

Once connected to a login node via SSH, use standard Slurm commands:

CommandDescription
sinfoView cluster status and node information
sinfo -RView nodes in drain state with reasons
squeueView the job queue
sbatchSubmit a batch job
srunRun a job interactively
scancelCancel a job
scontrol show nodeView detailed node information
scontrol show configView Slurm configuration

Running NCCL Tests

To validate multi-node GPU communication, run an NCCL all-reduce test. SSH into a login node and create the following script named nccl_test.batch. This example runs NCCL tests on H200 nodes. Different hardware types will have their own topo files. Note that you can find more test examples on your login node at /opt/examples:

#!/bin/bash

#SBATCH --job-name=nccl_tests
#SBATCH --nodes=<number-of-nodes>
#SBATCH --ntasks-per-node=8
#SBATCH --gpus-per-node=8
#SBATCH --time=20:00
#SBATCH --output="%x_%j.out"
#SBATCH --exclusive

export NCCL_TOPO_FILE=/etc/crusoe/nccl_topo/h200-141gb-sxm-ib-cloud-hypervisor.xml
export NCCL_SOCKET_IFNAME=eth0
export NCCL_IB_HCA="mlx5_1:1,mlx5_2:1,mlx5_3:1,mlx5_4:1,mlx5_5:1,mlx5_6:1,mlx5_7:1,mlx5_8:1"
export NCCL_IB_MERGE_VFS=0
export NCCL_DEBUG=WARN

export OMPI_MCA_coll_hcoll_enable=0
export PMIX_MCA_gds='^ds12'

export UCX_NET_DEVICES="mlx5_1:1,mlx5_2:1,mlx5_3:1,mlx5_4:1,mlx5_5:1,mlx5_6:1,mlx5_7:1,mlx5_8:1"

srun --mpi=pmix /opt/nccl-tests/build/all_reduce_perf -b 2G -e 32G -f 2
note

Update the NCCL_TOPO_FILE path to match your GPU type. The example above is for H200 nodes.

Submit the test:

sbatch nccl_test.batch

Monitor with squeue, and check the output file nccl_tests_<job_id>.out once complete.

To connect interactively to a worker node:

srun --pty bash                              # Any available worker
srun --nodelist=<worker-pod-name> --pty bash # A specific worker

Prolog and Epilog Scripts

Prolog and epilog scripts run automatically at the start and end of jobs. Managed Slurm supports two approaches: cluster-wide scripts configured by an admin, and per-job scripts specified by individual users.

Cluster-Wide Prolog/Epilog (Admin-Configured)

An admin places dispatcher scripts at a fixed path on the shared /home filesystem. Slurm calls them as root on every worker node at job start and end. The dispatcher can optionally call a per-user script if the user has created one.

Step 1 — Create the dispatcher scripts

From a login node:

mkdir -p /home/scripts

cat > /home/scripts/prolog.sh << 'EOF'
#!/bin/bash
USER_PROLOG="/home/${SLURM_JOB_USER}/prolog.sh"
if [ -x "$USER_PROLOG" ]; then
su "$SLURM_JOB_USER" -s /bin/bash -c "$USER_PROLOG"
EXIT_CODE=$?
if [ $EXIT_CODE -ne 0 ]; then
echo "$(date) job=$SLURM_JOB_ID node=$(hostname): prolog.sh failed with exit code $EXIT_CODE" \
>> "/home/${SLURM_JOB_USER}/prolog-errors.log"
fi
fi
exit 0
EOF

cat > /home/scripts/epilog.sh << 'EOF'
#!/bin/bash
USER_EPILOG="/home/${SLURM_JOB_USER}/epilog.sh"
if [ -x "$USER_EPILOG" ]; then
su "$SLURM_JOB_USER" -s /bin/bash -c "$USER_EPILOG"
EXIT_CODE=$?
if [ $EXIT_CODE -ne 0 ]; then
echo "$(date) job=$SLURM_JOB_ID node=$(hostname): epilog.sh failed with exit code $EXIT_CODE" \
>> "/home/${SLURM_JOB_USER}/epilog-errors.log"
fi
fi
exit 0
EOF

chmod 755 /home/scripts/prolog.sh /home/scripts/epilog.sh /home/scripts

Step 2 — Add Prolog/Epilog to the Controller CR

Edit the Controller CR and add these lines to spec.extraConf above the injected section markers (see Managing Partitions for details on editing spec.extraConf):

spec:
extraConf: |
Prolog=/home/scripts/prolog.sh
Epilog=/home/scripts/epilog.sh

# THE FOLLOWING SETTINGS ARE AUTOMATICALLY INJECTED BY CRUSOE SLURM OPERATOR
# ===============================START======================================
...

Step 3 — Reconfigure Slurm

scontrol reconfigure

Per-user scripts (optional)

Each user can create their own ~/prolog.sh and ~/epilog.sh. The dispatcher scripts above will call them automatically if they exist and are executable:

cat > ~/prolog.sh << 'EOF'
#!/bin/bash
echo "Job $SLURM_JOB_ID starting on $(hostname)" >> ~/prolog.log
exit 0
EOF
chmod +x ~/prolog.sh

If ~/prolog.sh is absent or not executable, it is silently skipped.

Key behaviors:

  • Runs once per node (not per task), as root
  • User scripts run as the submitting user via su
  • SLURM_* environment variables are available
  • Script output goes to /var/log/slurm/slurmd.log on each worker, not the job output file — redirect to a file in /home if you need visible output
  • Non-zero exit from the cluster-wide script requeues the job (JobRequeue=1 is set by default)

Per-Job Task Prolog/Epilog (User-Configured)

Users can specify prolog and epilog scripts per job using --task-prolog and --task-epilog. No admin setup is required.

In an sbatch script:

#!/bin/bash
#SBATCH --nodes=2
#SBATCH --task-prolog=/home/alice/setup.sh
#SBATCH --task-epilog=/home/alice/teardown.sh

srun ./myprogram

Or inline with srun:

srun --task-prolog=/home/alice/setup.sh --task-epilog=/home/alice/teardown.sh ./myprogram

Key behaviors:

  • Runs once per task (not per node), as the submitting user
  • stdout appears in the job's output file
  • Non-zero exit logs a warning but does not requeue the job
  • No admin involvement required

Execution Order

When both cluster-wide and per-job scripts are configured, they execute in this order:

Job allocated

Prolog=/home/scripts/prolog.sh (per node, as root)
└─ ~/prolog.sh if present (per node, as user)

--task-prolog (per task, as user)

srun ./myprogram

--task-epilog (per task, as user)

Epilog=/home/scripts/epilog.sh (per node, as root)
└─ ~/epilog.sh if present (per node, as user)

Automatic Hardware Remediation

Managed Slurm clusters run on Crusoe Managed Kubernetes with AutoClusters enabled. AutoClusters automatically detects critical hardware failures — such as a GPU or HCA falling off the bus — and remediates them without manual intervention.

What Happens During Remediation

  1. Crusoe's monitoring pipeline detects a hardware issue on a worker node
  2. The affected Slurm node is set to DOWN, which immediately cancels any running job on that node — the job process receives a SIGTERM signal before being terminated
  3. You have up to 2 minutes to handle the SIGTERM (save checkpoints, flush logs, etc.) before the node is replaced
  4. The cancelled job is automatically requeued (JobRequeue=1 is enabled by default) and runs on a healthy node

Handling SIGTERM in Your Jobs

When a node goes down, Slurm sends SIGTERM to your job process. You can use trap to catch this signal and perform cleanup before the job is cancelled. For example:

#!/bin/bash
#SBATCH --job-name=my-training-job
#SBATCH --nodes=1
#SBATCH --output=%x-%j.out

trap 'echo "[$(date -u +%Y-%m-%dT%H:%M:%SZ)] SIGTERM received — saving checkpoint"; save_checkpoint' SIGTERM

run_training &
wait $!

Next Steps

Support

If you encounter issues or need assistance, contact Crusoe Cloud Support.