Node Health Checks
Managed Slurm ships a built-in node health-check suite that runs automatically at three points in the job lifecycle:
- Periodic — every 5 minutes on idle nodes (
HealthCheckProgram) - Pre-job (prolog) — on each allocated node before a job starts
- Post-job (epilog) — on each allocated node after a job ends
Crusoe wires these dispatchers into slurm.conf for you. To add your own checks, use the SlurmClusterHealthCheck (SCHC) custom resource described below. Do not set your own Prolog, Epilog, or HealthCheckProgram in the Controller's spec.extraConf — they would conflict with the managed dispatchers.
When a check fails it either drains the node (no new jobs are scheduled; the reason is visible in sinfo -R and scontrol show node) or logs a warning (no scheduling impact). A drain during the prolog also requeues the in-flight job. Check results are surfaced in each worker pod's logfile container (view with kubectl logs); prolog and epilog output is additionally written to /var/log/prolog/<job-id>.log and /var/log/epilog/<job-id>.log on the node where they run.
Prerequisites
- A running Managed Slurm cluster (see Quickstart)
kubectlconfigured with access to the CMK cluster backing the Slurm cluster. Since the name of the backing CMK cluster matches the name of your Slurm cluster, the following command will givekubectlthe correct credentials:crusoe kubernetes clusters get-credentials <slurm-cluster-name>
Get Your Cluster Name
Run the following command to find the SlurmCluster name used in the health-check object names:
kubectl get slurmclusters -n slurm
Example output:
NAMESPACE NAME AGE
slurm my-slurm-cluster 4h25m
Limitations
- Add your own checks to the
<cluster>-customobject only. The<cluster>-defaultsobject is Crusoe-managed and may be overwritten. - Each object holds at most 50 scripts, and each script's
sourceis limited to 16 KB. - Script names must match
NN-name(a two-digit prefix followed by lowercase letters, digits, and hyphens) and be unique within the object. The prefix only orders your scripts relative to each other; they always run after the built-in checks. - To drain a node from a check, call
scontroldirectly (see Adding a Check). - Changes take a few seconds to propagate to all nodes and apply on the next run of the affected phase.
What the Built-in Checks Cover
| Phase | When it runs | What it checks (examples) |
|---|---|---|
Periodic (checks) | Every 5 min on idle nodes | /home and /data NFS mount, writability, and free space; GPU count vs. registered Slurm GRES; DCGM GPU health; NVIDIA driver persistence mode; InfiniBand port link state; system load, available memory/swap, local /tmp and /dev/shm space; kernel errors in dmesg (MCE, EDAC, disk I/O, NFS timeouts) |
Pre-job (prolog) | Before a job, on each allocated node | required SLURM_* job variables present; residual GPU memory below threshold; CUDA_VISIBLE_DEVICES count matches allocated GPUs; DCGM level-1 diagnostics |
Post-job (epilog) | After a job, on each allocated node | ECC double-bit errors recorded during the job; prune stale Docker containers; reset GPUs left with residual memory; kill leftover job processes and clean up orphaned shared memory; then re-run the periodic checks |
For the full list of built-in checks — including the exact failure trigger, whether it warns or drains the node, and when each is skipped — see the Default Check Reference at the bottom of this page.
Managing Your Checks
The health-check suite is managed by the SlurmClusterHealthCheck CRD (short name schc). Each cluster has two objects in the slurm namespace:
<cluster>-defaults— the Crusoe-provided checks. Managed by CSO; do not edit.<cluster>-custom— empty by default and never overwritten by CSO. Add your own checks here.
Your scripts run automatically after the built-in scripts for the matching phase. Each script runs independently, so a crash or unhandled error in one is logged as a warning and does not affect the other checks.
Use kubectl explain slurmclusterhealthchecks.spec to see all available fields and their descriptions directly from the cluster.
Adding a Check
Add a script by editing the -custom object:
kubectl edit schc <cluster>-custom -n slurm
apiVersion: slurm.crusoe.ai/v1alpha1
kind: SlurmClusterHealthCheck
metadata:
name: <cluster>-custom
namespace: slurm
spec:
scripts:
# Periodic check — runs every 5 minutes on idle nodes
- name: "60-scratch-mount"
type: checks
enabled: true
source: |
#!/usr/bin/env bash
if ! mountpoint -q /mnt/scratch; then
echo "$(date -u +%FT%TZ) scratch_mount FAIL host=$(hostname -s) reason=not-mounted" >&2
scontrol update NodeName="$(hostname -s)" State=DRAIN Reason="scratch: /mnt/scratch not mounted"
fi
exit 0
# Pre-job check — runs before every job on each allocated node
- name: "60-dataset-present"
type: prolog
enabled: true
source: |
#!/usr/bin/env bash
if [[ ! -r /data/shared/dataset.bin ]]; then
echo "$(date -u +%FT%TZ) dataset_present FAIL host=$(hostname -s) job=${SLURM_JOB_ID} reason=dataset-unreadable" >&2
scontrol update NodeName="$(hostname -s)" State=DRAIN Reason="prolog: shared dataset unreadable"
exit 1 # drains the node and requeues the job on a healthy node
fi
exit 0
# Post-job check — runs after every job on each allocated node
- name: "60-scratch-cleanup"
type: epilog
enabled: true
source: |
#!/usr/bin/env bash
rm -rf "/mnt/scratch/job-${SLURM_JOB_ID}" || \
echo "$(date -u +%FT%TZ) scratch_cleanup WARN host=$(hostname -s) reason=cleanup-failed" >&2
exit 0 # epilog must always exit 0
Field reference:
name— script filename in the formNN-name(a two-digit prefix, then lowercase letters, digits, and hyphens). The prefix orders your scripts relative to each other within a phase; any number works, since your scripts always run after the built-in checks.type— one ofchecks,prolog, orepilog.enabled— set tofalseto keep a script defined but exclude it from execution.source— the shell script. Write to stderr for log output (surfaced in the worker pod'slogfilecontainer viakubectl logs).
Available environment variables:
- Periodic (
checks) scripts run with no job context — noSLURM_JOB_*variables are set. prologandepilogscripts run with the standard Slurm job environment, includingSLURM_JOB_ID,SLURM_JOB_USER,SLURM_JOB_UID,SLURM_JOB_NODELIST, and — for GPU jobs —CUDA_VISIBLE_DEVICESandSLURM_JOB_GPUS.
Draining and exit codes:
- To drain the node, call
scontroldirectly:scontrol update NodeName="$(hostname -s)" State=DRAIN Reason="<short reason>"(keep the reason under 250 characters). checksandprologscripts shouldexit 0after draining — the drain is signalled by thescontrolcall, not the exit code. In aprologscript,exit 1instead if you also want the job requeued.epilogscripts must alwaysexit 0— Slurm treats a non-zero epilog exit as a node-fatal event.- For a warn-only check, just log to stderr and
exit 0.
Changes are picked up automatically across all nodes within seconds of saving (no scontrol reconfigure needed) and take effect on the next run of that phase.
Viewing Your Checks
List the health-check objects for your cluster:
kubectl get schc -n slurm
To see the scripts currently defined in your custom object:
kubectl get schc <cluster>-custom -n slurm -o yaml
The scripts are mounted on each worker pod under /opt/crusoe/healthcheck, so you can inspect what's running on a node directly:
kubectl exec -n slurm <worker-pod> -c slurmd -- ls /opt/crusoe/healthcheck/
Removing a Check
Edit the -custom object and delete the script's entry from spec.scripts (or set enabled: false to keep it defined but inactive):
kubectl edit schc <cluster>-custom -n slurm
The script is removed from the affected nodes within seconds.
Per-Job Task Prolog/Epilog (User-Configured)
Users can specify prolog and epilog scripts per job using --task-prolog and --task-epilog. No admin setup is required.
In an sbatch script:
#!/bin/bash
#SBATCH --nodes=2
#SBATCH --task-prolog=/home/alice/setup.sh
#SBATCH --task-epilog=/home/alice/teardown.sh
srun ./myprogram
Or inline with srun:
srun --task-prolog=/home/alice/setup.sh --task-epilog=/home/alice/teardown.sh ./myprogram
Key behaviors:
- Runs once per task (not per node), as the submitting user
- stdout appears in the job's output file
- Non-zero exit logs a warning but does not requeue the job
- No admin involvement required
Execution Order
When both the managed health-check prolog/epilog and per-job task scripts are configured, each allocated node runs them in this order:
| Step | What runs | Runs as | Scope |
|---|---|---|---|
| 1 | Prolog dispatcher — built-in ambient checks, then your checks scripts | root | Per node |
| 2 | Prolog dispatcher — built-in pre-job checks, then your prolog scripts | root | Per node |
| 3 | --task-prolog (if set on the job) | Submitting user | Per task |
| 4 | The job itself (srun ./myprogram) | Submitting user | Per task |
| 5 | --task-epilog (if set on the job) | Submitting user | Per task |
| 6 | Epilog dispatcher — built-in cleanup, then your epilog scripts | root | Per node |
| 7 | Epilog dispatcher — built-in ambient re-check, then your checks scripts | root | Per node |
Default Check Reference
The built-in checks shipped in the <cluster>-defaults object, grouped by phase. GPU and InfiniBand checks skip cleanly on nodes without that hardware. Thresholds shown are the defaults. Changes to default checks (disabling checks, adjusting thresholds, changing failure behavior) can be done by reaching out to our support team.
Periodic Checks
Run every 5 minutes on idle nodes.
| Check | What it verifies | On failure | Skipped when |
|---|---|---|---|
10-filesystem | /home is NFS-mounted and writable | Warn — not NFS-mounted, or the write probe fails | — |
11-filesystem-data | /data is NFS-mounted and writable | Warn — not NFS-mounted, or the write probe fails | /data is absent on the node |
12-filesystem-space | /home has at least 50 GB free | Warn — free space below threshold, or df fails | — |
13-filesystem-data-space | /data has at least 50 GB free | Warn — free space below threshold, or df fails | /data is absent on the node |
15-hardware | GPU count (from Slurm GRES and nvidia-smi) matches the expected count (8) | Drain — GPU count below expected; Warn — GRES unavailable or the nvidia-smi query fails | GPU portion skipped on non-GPU nodes |
20-gpu-dcgm | DCGM reports all GPUs healthy | Drain — DCGM overall health is a failure; Warn — DCGM reports a warning-level event, or DCGM is unreachable | Non-GPU node |
25-network-ib | All InfiniBand ports are in the Active link state | Drain — any port is not Active | The node has no InfiniBand ports |
30-driver | NVIDIA persistence mode is enabled | Drain — persistence mode is not enabled | Non-GPU node |
35-load | 1-minute load average is within 2× the CPU count | Warn — load above threshold | — |
40-fs-local | /tmp and /dev/shm have at least 1 GB free and 10,000 free inodes | Drain — free space below threshold; Warn — free inodes below threshold | — |
45-memory | At least 2 GB of memory is available and swap is not in use | Warn — low available memory, or swap in active use | — |
50-dmesg | Kernel ring buffer is free of fatal hardware errors | Drain — machine-check exception, uncorrectable memory (EDAC) error, disk I/O error, or a recent NFS "server not responding"; Warn — correctable memory (EDAC) error | The ring buffer is empty |
Pre-job Checks (Prolog)
Run on each allocated node before the job starts. A failure drains the node and requeues the job.
| Check | What it verifies | On failure | Skipped when |
|---|---|---|---|
10-job-env | Required job variables (SLURM_JOB_ID, SLURM_JOB_USER, SLURM_JOB_UID) are set | Drain + requeue — any variable is missing | — |
20-gpu-residual | Assigned GPUs hold less than 100 MiB of residual memory | Drain + requeue — any assigned GPU is over threshold | Non-GPU node, or no GPUs assigned to the job |
25-cuda-visible | CUDA_VISIBLE_DEVICES count matches the GPUs allocated to the job | Drain + requeue — visible count is below the allocated count | Non-GPU node, no GPUs assigned, or the allocated count cannot be determined |
30-dcgm-diag | DCGM level-1 diagnostic passes on the assigned GPUs | Drain + requeue — the diagnostic fails or times out | Non-GPU node, no GPUs assigned, or a diagnostic is already running on the node |
Post-job Checks (Epilog)
Run on each allocated node after the job ends. Epilog never fails the job via its exit code — health failures drain the node via scontrol instead.
| Check | What it does | On failure | Skipped when |
|---|---|---|---|
10-dcgm-stats | Checks job-scoped DCGM stats for ECC double-bit errors | Drain — one or more ECC double-bit errors were recorded during the job | Non-GPU node, no job ID, or stats are unavailable |
20-containers | Prunes stopped Docker containers idle for more than 1 hour | Warn — the prune command fails | — |
30-gpu-reset | Resets GPUs that still hold residual memory after the job | Warn — the reset fails | Non-GPU node, or no GPUs assigned |
40-processes-ipc | Kills leftover job processes and removes orphaned shared-memory segments | Warn — lingering processes were found and killed (informational) | No job context, or the job user is a system account (UID < 1000) |
Next Steps
- Quickstart — Create your first Slurm cluster
- User Management — Add users and groups, manage partitions
- Managing Partitions — Create and manage partitions in your Slurm cluster
- Slurm Metrics — Monitor cluster health and performance
- Advanced: Kubernetes Operations — Direct kubectl access and CRD-level configuration
- For Slurm command reference, see the official Slurm documentation