Advanced: Kubernetes Operations

Managed Slurm clusters run on Crusoe Managed Kubernetes (CMK). While the CLI and API handle most operations, you can also interact with the cluster directly via kubectl for advanced configuration, troubleshooting, and user management.

Accessing Your Cluster via kubectl

Configure kubectl to connect to your cluster:

crusoe kubernetes clusters get-credentials <cluster-name>

Verify the connection:

kubectl cluster-info

Viewing Cluster State

Slurm Custom Resources

The Crusoe Slurm Operator (CSO) uses four Custom Resource types to manage your cluster. When you create a Slurm cluster through the Slurm UI or crusoe slurm CLI, these CRDs will be created, controlled, and reconciled by Crusoe in the slurm namespace.

kubectl get slurmclusters -n slurm        # Clusters
kubectl get nodesets -n slurm         # Node sets
kubectl get slurmusers -n slurm            # All users
kubectl get slurmusergroups -n slurm       # All groups

For detailed status with conditions:

kubectl describe slurmclusters <name> -n slurm

tip

Use kubectl explain for field-level documentation on any Custom Resource. The CRDs have built-in descriptions for every field:

kubectl explain slurmclusters.spec
kubectl explain slurmusers.spec
kubectl explain slurmusergroups.spec
kubectl explain nodesets.spec

Cluster Phase and Conditions

The SlurmCluster status includes a phase field (possible values: Provisioning, Installing, Ready, NotReady, CreateFailed, Deleting, DeleteFailed) and detailed conditions for each component. Use kubectl describe to see what's happening if your cluster isn't healthy:

kubectl describe slurmclusters <name> -n slurm

Look for conditions like:

ControllerReady — Slurm controller pod is running
LoginReady — Login nodes are running
SlinkyReady — Slinky Helm chart is installed
CertManagerReady — cert-manager is installed
LoadBalancerReady — Load balancer controller is installed
TopographReady — Topology discovery is running

Node Set Readiness

The NodeSet status shows a readyReplicas field in "ready/total" format:

kubectl get nodesets -n slurm

NAME                    READY   AGE
slurm-worker-node-set   2/2     1h

Storage Configuration

Managed Slurm uses a shared filesystem for the /home directory, mounted across all login and worker nodes. This is backed by a PersistentVolumeClaim (PVC) using the Crusoe CSI driver.

StorageClass

The operator automatically creates a StorageClass named <cluster-name>-crusoe-csi-driver-fs-sc with:

Provisioner: fs.csi.crusoe.ai
Volume binding mode: WaitForFirstConsumer
Volume expansion: Enabled

Viewing Storage

kubectl get pvc -n slurm         # View persistent volume claims
kubectl get sc                    # View storage classes

What the Operator Manages

The Crusoe Slurm Operator (CSO) continuously reconciles certain resources to keep your cluster in a healthy state. Understanding what CSO manages helps you know what's safe to modify and what will be reverted.

Reconciliation Reference

Resource	Managed by CSO?	Safe to Modify?	Notes
SlurmUser CRs	No — customer-owned	Yes	This is the intended way to manage users
SlurmUserGroup CRs	No — customer-owned	Yes	This is the intended way to manage groups
Slinky CRDs (Controller, LoginSet, NodeSet)	Yes — full overwrite on every reconcile	No	Changes will be reverted automatically
gres-config ConfigMap	Yes — reconciled every cycle	No	Hardcoded to `AutoDetect=nvidia`
pyxis-config ConfigMap	Yes — reconciled every cycle (if Pyxis enabled)	No	Hardcoded default config
Auth Secrets (slurm-auth, jwt-auth)	Create-once — never regenerated	Do not modify	Breaking these breaks cluster authentication
nsscache ConfigMap	Yes — regenerated on user changes	No	Auto-managed from SlurmUser CRs
ssh-keys Secret	Yes — regenerated on user changes	No	Auto-managed from SlurmUser CRs
topology.conf ConfigMap	Created by CSO, then managed by Topograph	No	Topograph updates this automatically based on network topology
Home PVC	Validated but not overwritten after creation	Storage size only (increase)	Spec is immutable after initial creation
StorageClasses	Yes — reconciled every cycle	No	Provisioner and settings are fixed
Your own resources	Never	Yes	CSO ignores any resources it doesn't own

info

Key takeaway: The resources you should interact with are SlurmUser and SlurmUserGroup CRs (see User Management). Everything else is either managed by CSO or managed via the Crusoe CLI/API.

Slurm Commands Reference

Once connected to a login node via SSH, use standard Slurm commands:

Command	Description
`sinfo`	View cluster status and node information
`sinfo -R`	View nodes in drain state with reasons
`squeue`	View the job queue
`sbatch`	Submit a batch job
`srun`	Run a job interactively
`scancel`	Cancel a job
`scontrol show node`	View detailed node information
`scontrol show config`	View Slurm configuration

Running NCCL Tests

To validate multi-node GPU communication, run an NCCL all-reduce test. SSH into a login node and create the following script named nccl_test.batch. This example runs NCCL tests on H200 nodes. Different hardware types will have their own topo files. Note that you can find more test examples on your login node at /opt/examples:

#!/bin/bash

#SBATCH --job-name=nccl_tests
#SBATCH --nodes=<number-of-nodes>
#SBATCH --ntasks-per-node=8
#SBATCH --gpus-per-node=8
#SBATCH --time=20:00
#SBATCH --output="%x_%j.out"
#SBATCH --exclusive

export NCCL_TOPO_FILE=/etc/crusoe/nccl_topo/h200-141gb-sxm-ib-cloud-hypervisor.xml
export NCCL_SOCKET_IFNAME=eth0
export NCCL_IB_HCA="mlx5_1:1,mlx5_2:1,mlx5_3:1,mlx5_4:1,mlx5_5:1,mlx5_6:1,mlx5_7:1,mlx5_8:1"
export NCCL_IB_MERGE_VFS=0
export NCCL_DEBUG=WARN

export OMPI_MCA_coll_hcoll_enable=0
export PMIX_MCA_gds='^ds12'

export UCX_NET_DEVICES="mlx5_1:1,mlx5_2:1,mlx5_3:1,mlx5_4:1,mlx5_5:1,mlx5_6:1,mlx5_7:1,mlx5_8:1"

srun --mpi=pmix /opt/nccl-tests/build/all_reduce_perf -b 2G -e 32G -f 2

note

Update the NCCL_TOPO_FILE path to match your GPU type. The example above is for H200 nodes.

Submit the test:

sbatch nccl_test.batch

Monitor with squeue, and check the output file nccl_tests_<job_id>.out once complete.

To connect interactively to a worker node:

srun --pty bash                              # Any available worker
srun --nodelist=<worker-pod-name> --pty bash # A specific worker

Prolog and Epilog Scripts

Prolog and epilog scripts run automatically at the start and end of jobs. Managed Slurm supports two approaches: cluster-wide scripts configured by an admin, and per-job scripts specified by individual users.

Cluster-Wide Prolog/Epilog (Admin-Configured)

An admin places dispatcher scripts at a fixed path on the shared /home filesystem. Slurm calls them as root on every worker node at job start and end. The dispatcher can optionally call a per-user script if the user has created one.

Step 1 — Create the dispatcher scripts

From a login node:

mkdir -p /home/scripts

cat > /home/scripts/prolog.sh << 'EOF'
#!/bin/bash
USER_PROLOG="/home/${SLURM_JOB_USER}/prolog.sh"
if [ -x "$USER_PROLOG" ]; then
    su "$SLURM_JOB_USER" -s /bin/bash -c "$USER_PROLOG"
    EXIT_CODE=$?
    if [ $EXIT_CODE -ne 0 ]; then
        echo "$(date) job=$SLURM_JOB_ID node=$(hostname): prolog.sh failed with exit code $EXIT_CODE" \
            >> "/home/${SLURM_JOB_USER}/prolog-errors.log"
    fi
fi
exit 0
EOF

cat > /home/scripts/epilog.sh << 'EOF'
#!/bin/bash
USER_EPILOG="/home/${SLURM_JOB_USER}/epilog.sh"
if [ -x "$USER_EPILOG" ]; then
    su "$SLURM_JOB_USER" -s /bin/bash -c "$USER_EPILOG"
    EXIT_CODE=$?
    if [ $EXIT_CODE -ne 0 ]; then
        echo "$(date) job=$SLURM_JOB_ID node=$(hostname): epilog.sh failed with exit code $EXIT_CODE" \
            >> "/home/${SLURM_JOB_USER}/epilog-errors.log"
    fi
fi
exit 0
EOF

chmod 755 /home/scripts/prolog.sh /home/scripts/epilog.sh /home/scripts

Step 2 — Add Prolog/Epilog to the Controller CR

Edit the Controller CR and add these lines to spec.extraConf above the injected section markers (see Managing Partitions for details on editing spec.extraConf):

spec:
  extraConf: |
    Prolog=/home/scripts/prolog.sh
    Epilog=/home/scripts/epilog.sh

    # THE FOLLOWING SETTINGS ARE AUTOMATICALLY INJECTED BY CRUSOE SLURM OPERATOR
    # ===============================START======================================
    ...

Step 3 — Reconfigure Slurm

scontrol reconfigure

Per-user scripts (optional)

Each user can create their own ~/prolog.sh and ~/epilog.sh. The dispatcher scripts above will call them automatically if they exist and are executable:

cat > ~/prolog.sh << 'EOF'
#!/bin/bash
echo "Job $SLURM_JOB_ID starting on $(hostname)" >> ~/prolog.log
exit 0
EOF
chmod +x ~/prolog.sh

If ~/prolog.sh is absent or not executable, it is silently skipped.

Key behaviors:

Runs once per node (not per task), as root
User scripts run as the submitting user via su
SLURM_* environment variables are available
Script output goes to /var/log/slurm/slurmd.log on each worker, not the job output file — redirect to a file in /home if you need visible output
Non-zero exit from the cluster-wide script requeues the job (JobRequeue=1 is set by default)

Per-Job Task Prolog/Epilog (User-Configured)

Users can specify prolog and epilog scripts per job using --task-prolog and --task-epilog. No admin setup is required.

In an sbatch script:

#!/bin/bash
#SBATCH --nodes=2
#SBATCH --task-prolog=/home/alice/setup.sh
#SBATCH --task-epilog=/home/alice/teardown.sh

srun ./myprogram

Or inline with srun:

srun --task-prolog=/home/alice/setup.sh --task-epilog=/home/alice/teardown.sh ./myprogram

Key behaviors:

Runs once per task (not per node), as the submitting user
stdout appears in the job's output file
Non-zero exit logs a warning but does not requeue the job
No admin involvement required

Execution Order

When both cluster-wide and per-job scripts are configured, they execute in this order:

Job allocated
  ↓
Prolog=/home/scripts/prolog.sh   (per node, as root)
  └─ ~/prolog.sh if present      (per node, as user)
  ↓
--task-prolog                     (per task, as user)
  ↓
srun ./myprogram
  ↓
--task-epilog                     (per task, as user)
  ↓
Epilog=/home/scripts/epilog.sh   (per node, as root)
  └─ ~/epilog.sh if present      (per node, as user)

Automatic Hardware Remediation

Managed Slurm clusters run on Crusoe Managed Kubernetes with AutoClusters enabled. AutoClusters automatically detects critical hardware failures — such as a GPU or HCA falling off the bus — and remediates them without manual intervention.

What Happens During Remediation

Crusoe's monitoring pipeline detects a hardware issue on a worker node
The affected Slurm node is set to DOWN, which immediately cancels any running job on that node — the job process receives a SIGTERM signal before being terminated
You have up to 2 minutes to handle the SIGTERM (save checkpoints, flush logs, etc.) before the node is replaced
The cancelled job is automatically requeued (JobRequeue=1 is enabled by default) and runs on a healthy node

Handling SIGTERM in Your Jobs

When a node goes down, Slurm sends SIGTERM to your job process. You can use trap to catch this signal and perform cleanup before the job is cancelled. For example:

#!/bin/bash
#SBATCH --job-name=my-training-job
#SBATCH --nodes=1
#SBATCH --output=%x-%j.out

trap 'echo "[$(date -u +%Y-%m-%dT%H:%M:%SZ)] SIGTERM received — saving checkpoint"; save_checkpoint' SIGTERM

run_training &
wait $!

Next Steps

Quickstart — Create your first Slurm cluster
User Management — Add users and groups, manage partitions
Managing Partitions — Create and manage partitions in your Slurm cluster
Slurm Metrics — Monitor cluster health and performance
For Slurm command reference, see the official Slurm documentation

Support

If you encounter issues or need assistance, contact Crusoe Cloud Support.

Accessing Your Cluster via kubectl​

Viewing Cluster State​

Slurm Custom Resources​

Cluster Phase and Conditions​

Node Set Readiness​

Storage Configuration​

StorageClass​

Viewing Storage​

What the Operator Manages​

Reconciliation Reference​

Slurm Commands Reference​

Running NCCL Tests​

Prolog and Epilog Scripts​

Cluster-Wide Prolog/Epilog (Admin-Configured)​

Per-Job Task Prolog/Epilog (User-Configured)​

Execution Order​

Automatic Hardware Remediation​

What Happens During Remediation​

Handling SIGTERM in Your Jobs​

Next Steps​

Support​