Skip to main content

Node Health Checks

Managed Slurm ships a built-in node health-check suite that runs automatically at three points in the job lifecycle:

  • Periodic — every 5 minutes on idle nodes (HealthCheckProgram)
  • Pre-job (prolog) — on each allocated node before a job starts
  • Post-job (epilog) — on each allocated node after a job ends

Crusoe wires these dispatchers into slurm.conf for you. To add your own checks, use the SlurmClusterHealthCheck (SCHC) custom resource described below. Do not set your own Prolog, Epilog, or HealthCheckProgram in the Controller's spec.extraConf — they would conflict with the managed dispatchers.

When a check fails it either drains the node (no new jobs are scheduled; the reason is visible in sinfo -R and scontrol show node) or logs a warning (no scheduling impact). A drain during the prolog also requeues the in-flight job. Check results are surfaced in each worker pod's logfile container (view with kubectl logs); prolog and epilog output is additionally written to /var/log/prolog/<job-id>.log and /var/log/epilog/<job-id>.log on the node where they run.

Prerequisites

  • A running Managed Slurm cluster (see Quickstart)
  • kubectl configured with access to the CMK cluster backing the Slurm cluster. Since the name of the backing CMK cluster matches the name of your Slurm cluster, the following command will give kubectl the correct credentials:
    crusoe kubernetes clusters get-credentials <slurm-cluster-name>

Get Your Cluster Name

Run the following command to find the SlurmCluster name used in the health-check object names:

kubectl get slurmclusters -n slurm

Example output:

NAMESPACE NAME AGE
slurm my-slurm-cluster 4h25m

Limitations

  • Add your own checks to the <cluster>-custom object only. The <cluster>-defaults object is Crusoe-managed and may be overwritten.
  • Each object holds at most 50 scripts, and each script's source is limited to 16 KB.
  • Script names must match NN-name (a two-digit prefix followed by lowercase letters, digits, and hyphens) and be unique within the object. The prefix only orders your scripts relative to each other; they always run after the built-in checks.
  • To drain a node from a check, call scontrol directly (see Adding a Check).
  • Changes take a few seconds to propagate to all nodes and apply on the next run of the affected phase.

What the Built-in Checks Cover

PhaseWhen it runsWhat it checks (examples)
Periodic (checks)Every 5 min on idle nodes/home and /data NFS mount, writability, and free space; GPU count vs. registered Slurm GRES; DCGM GPU health; NVIDIA driver persistence mode; InfiniBand port link state; system load, available memory/swap, local /tmp and /dev/shm space; kernel errors in dmesg (MCE, EDAC, disk I/O, NFS timeouts)
Pre-job (prolog)Before a job, on each allocated noderequired SLURM_* job variables present; residual GPU memory below threshold; CUDA_VISIBLE_DEVICES count matches allocated GPUs; DCGM level-1 diagnostics
Post-job (epilog)After a job, on each allocated nodeECC double-bit errors recorded during the job; prune stale Docker containers; reset GPUs left with residual memory; kill leftover job processes and clean up orphaned shared memory; then re-run the periodic checks

For the full list of built-in checks — including the exact failure trigger, whether it warns or drains the node, and when each is skipped — see the Default Check Reference at the bottom of this page.

Managing Your Checks

The health-check suite is managed by the SlurmClusterHealthCheck CRD (short name schc). Each cluster has two objects in the slurm namespace:

  • <cluster>-defaults — the Crusoe-provided checks. Managed by CSO; do not edit.
  • <cluster>-custom — empty by default and never overwritten by CSO. Add your own checks here.

Your scripts run automatically after the built-in scripts for the matching phase. Each script runs independently, so a crash or unhandled error in one is logged as a warning and does not affect the other checks.

tip

Use kubectl explain slurmclusterhealthchecks.spec to see all available fields and their descriptions directly from the cluster.

Adding a Check

Add a script by editing the -custom object:

kubectl edit schc <cluster>-custom -n slurm
apiVersion: slurm.crusoe.ai/v1alpha1
kind: SlurmClusterHealthCheck
metadata:
name: <cluster>-custom
namespace: slurm
spec:
scripts:
# Periodic check — runs every 5 minutes on idle nodes
- name: "60-scratch-mount"
type: checks
enabled: true
source: |
#!/usr/bin/env bash
if ! mountpoint -q /mnt/scratch; then
echo "$(date -u +%FT%TZ) scratch_mount FAIL host=$(hostname -s) reason=not-mounted" >&2
scontrol update NodeName="$(hostname -s)" State=DRAIN Reason="scratch: /mnt/scratch not mounted"
fi
exit 0

# Pre-job check — runs before every job on each allocated node
- name: "60-dataset-present"
type: prolog
enabled: true
source: |
#!/usr/bin/env bash
if [[ ! -r /data/shared/dataset.bin ]]; then
echo "$(date -u +%FT%TZ) dataset_present FAIL host=$(hostname -s) job=${SLURM_JOB_ID} reason=dataset-unreadable" >&2
scontrol update NodeName="$(hostname -s)" State=DRAIN Reason="prolog: shared dataset unreadable"
exit 1 # drains the node and requeues the job on a healthy node
fi
exit 0

# Post-job check — runs after every job on each allocated node
- name: "60-scratch-cleanup"
type: epilog
enabled: true
source: |
#!/usr/bin/env bash
rm -rf "/mnt/scratch/job-${SLURM_JOB_ID}" || \
echo "$(date -u +%FT%TZ) scratch_cleanup WARN host=$(hostname -s) reason=cleanup-failed" >&2
exit 0 # epilog must always exit 0

Field reference:

  • name — script filename in the form NN-name (a two-digit prefix, then lowercase letters, digits, and hyphens). The prefix orders your scripts relative to each other within a phase; any number works, since your scripts always run after the built-in checks.
  • type — one of checks, prolog, or epilog.
  • enabled — set to false to keep a script defined but exclude it from execution.
  • source — the shell script. Write to stderr for log output (surfaced in the worker pod's logfile container via kubectl logs).

Available environment variables:

  • Periodic (checks) scripts run with no job context — no SLURM_JOB_* variables are set.
  • prolog and epilog scripts run with the standard Slurm job environment, including SLURM_JOB_ID, SLURM_JOB_USER, SLURM_JOB_UID, SLURM_JOB_NODELIST, and — for GPU jobs — CUDA_VISIBLE_DEVICES and SLURM_JOB_GPUS.

Draining and exit codes:

  • To drain the node, call scontrol directly: scontrol update NodeName="$(hostname -s)" State=DRAIN Reason="<short reason>" (keep the reason under 250 characters).
  • checks and prolog scripts should exit 0 after draining — the drain is signalled by the scontrol call, not the exit code. In a prolog script, exit 1 instead if you also want the job requeued.
  • epilog scripts must always exit 0 — Slurm treats a non-zero epilog exit as a node-fatal event.
  • For a warn-only check, just log to stderr and exit 0.

Changes are picked up automatically across all nodes within seconds of saving (no scontrol reconfigure needed) and take effect on the next run of that phase.

Viewing Your Checks

List the health-check objects for your cluster:

kubectl get schc -n slurm

To see the scripts currently defined in your custom object:

kubectl get schc <cluster>-custom -n slurm -o yaml

The scripts are mounted on each worker pod under /opt/crusoe/healthcheck, so you can inspect what's running on a node directly:

kubectl exec -n slurm <worker-pod> -c slurmd -- ls /opt/crusoe/healthcheck/

Removing a Check

Edit the -custom object and delete the script's entry from spec.scripts (or set enabled: false to keep it defined but inactive):

kubectl edit schc <cluster>-custom -n slurm

The script is removed from the affected nodes within seconds.

Per-Job Task Prolog/Epilog (User-Configured)

Users can specify prolog and epilog scripts per job using --task-prolog and --task-epilog. No admin setup is required.

In an sbatch script:

#!/bin/bash
#SBATCH --nodes=2
#SBATCH --task-prolog=/home/alice/setup.sh
#SBATCH --task-epilog=/home/alice/teardown.sh

srun ./myprogram

Or inline with srun:

srun --task-prolog=/home/alice/setup.sh --task-epilog=/home/alice/teardown.sh ./myprogram

Key behaviors:

  • Runs once per task (not per node), as the submitting user
  • stdout appears in the job's output file
  • Non-zero exit logs a warning but does not requeue the job
  • No admin involvement required

Execution Order

When both the managed health-check prolog/epilog and per-job task scripts are configured, each allocated node runs them in this order:

StepWhat runsRuns asScope
1Prolog dispatcher — built-in ambient checks, then your checks scriptsrootPer node
2Prolog dispatcher — built-in pre-job checks, then your prolog scriptsrootPer node
3--task-prolog (if set on the job)Submitting userPer task
4The job itself (srun ./myprogram)Submitting userPer task
5--task-epilog (if set on the job)Submitting userPer task
6Epilog dispatcher — built-in cleanup, then your epilog scriptsrootPer node
7Epilog dispatcher — built-in ambient re-check, then your checks scriptsrootPer node

Default Check Reference

The built-in checks shipped in the <cluster>-defaults object, grouped by phase. GPU and InfiniBand checks skip cleanly on nodes without that hardware. Thresholds shown are the defaults. Changes to default checks (disabling checks, adjusting thresholds, changing failure behavior) can be done by reaching out to our support team.

Periodic Checks

Run every 5 minutes on idle nodes.

CheckWhat it verifiesOn failureSkipped when
10-filesystem/home is NFS-mounted and writableWarn — not NFS-mounted, or the write probe fails
11-filesystem-data/data is NFS-mounted and writableWarn — not NFS-mounted, or the write probe fails/data is absent on the node
12-filesystem-space/home has at least 50 GB freeWarn — free space below threshold, or df fails
13-filesystem-data-space/data has at least 50 GB freeWarn — free space below threshold, or df fails/data is absent on the node
15-hardwareGPU count (from Slurm GRES and nvidia-smi) matches the expected count (8)Drain — GPU count below expected; Warn — GRES unavailable or the nvidia-smi query failsGPU portion skipped on non-GPU nodes
20-gpu-dcgmDCGM reports all GPUs healthyDrain — DCGM overall health is a failure; Warn — DCGM reports a warning-level event, or DCGM is unreachableNon-GPU node
25-network-ibAll InfiniBand ports are in the Active link stateDrain — any port is not ActiveThe node has no InfiniBand ports
30-driverNVIDIA persistence mode is enabledDrain — persistence mode is not enabledNon-GPU node
35-load1-minute load average is within 2× the CPU countWarn — load above threshold
40-fs-local/tmp and /dev/shm have at least 1 GB free and 10,000 free inodesDrain — free space below threshold; Warn — free inodes below threshold
45-memoryAt least 2 GB of memory is available and swap is not in useWarn — low available memory, or swap in active use
50-dmesgKernel ring buffer is free of fatal hardware errorsDrain — machine-check exception, uncorrectable memory (EDAC) error, disk I/O error, or a recent NFS "server not responding"; Warn — correctable memory (EDAC) errorThe ring buffer is empty

Pre-job Checks (Prolog)

Run on each allocated node before the job starts. A failure drains the node and requeues the job.

CheckWhat it verifiesOn failureSkipped when
10-job-envRequired job variables (SLURM_JOB_ID, SLURM_JOB_USER, SLURM_JOB_UID) are setDrain + requeue — any variable is missing
20-gpu-residualAssigned GPUs hold less than 100 MiB of residual memoryDrain + requeue — any assigned GPU is over thresholdNon-GPU node, or no GPUs assigned to the job
25-cuda-visibleCUDA_VISIBLE_DEVICES count matches the GPUs allocated to the jobDrain + requeue — visible count is below the allocated countNon-GPU node, no GPUs assigned, or the allocated count cannot be determined
30-dcgm-diagDCGM level-1 diagnostic passes on the assigned GPUsDrain + requeue — the diagnostic fails or times outNon-GPU node, no GPUs assigned, or a diagnostic is already running on the node

Post-job Checks (Epilog)

Run on each allocated node after the job ends. Epilog never fails the job via its exit code — health failures drain the node via scontrol instead.

CheckWhat it doesOn failureSkipped when
10-dcgm-statsChecks job-scoped DCGM stats for ECC double-bit errorsDrain — one or more ECC double-bit errors were recorded during the jobNon-GPU node, no job ID, or stats are unavailable
20-containersPrunes stopped Docker containers idle for more than 1 hourWarn — the prune command fails
30-gpu-resetResets GPUs that still hold residual memory after the jobWarn — the reset failsNon-GPU node, or no GPUs assigned
40-processes-ipcKills leftover job processes and removes orphaned shared-memory segmentsWarn — lingering processes were found and killed (informational)No job context, or the job user is a system account (UID < 1000)

Next Steps