Node Health Checks

Managed Slurm ships a built-in node health-check suite that runs automatically at three points in the job lifecycle:

Periodic — every 5 minutes on idle nodes (HealthCheckProgram)
Pre-job (prolog) — on each allocated node before a job starts
Post-job (epilog) — on each allocated node after a job ends

Crusoe wires these dispatchers into slurm.conf for you. To add your own checks, use the SlurmClusterHealthCheck (SCHC) custom resource described below. Do not set your own Prolog, Epilog, or HealthCheckProgram in the Controller's spec.extraConf — they would conflict with the managed dispatchers.

When a check fails it either drains the node (no new jobs are scheduled; the reason is visible in sinfo -R and scontrol show node) or logs a warning (no scheduling impact). A drain during the prolog also requeues the in-flight job. Check results are surfaced in each worker pod's logfile container (view with kubectl logs); prolog and epilog output is additionally written to /var/log/prolog/<job-id>.log and /var/log/epilog/<job-id>.log on the node where they run.

Prerequisites

A running Managed Slurm cluster (see Quickstart)
kubectl configured with access to the CMK cluster backing the Slurm cluster. Since the name of the backing CMK cluster matches the name of your Slurm cluster, the following command will give kubectl the correct credentials:
```
crusoe kubernetes clusters get-credentials <slurm-cluster-name>
```

Get Your Cluster Name

Run the following command to find the SlurmCluster name used in the health-check object names:

kubectl get slurmclusters -n slurm

Example output:

NAMESPACE   NAME                        AGE
slurm       my-slurm-cluster            4h25m

Limitations

Add your own checks to the <cluster>-custom object only. The <cluster>-defaults object is Crusoe-managed and may be overwritten.
Each object holds at most 50 scripts, and each script's source is limited to 16 KB.
Script names must match NN-name (a two-digit prefix followed by lowercase letters, digits, and hyphens) and be unique within the object. The prefix only orders your scripts relative to each other; they always run after the built-in checks.
To drain a node from a check, call scontrol directly (see Adding a Check).
Changes take a few seconds to propagate to all nodes and apply on the next run of the affected phase.

What the Built-in Checks Cover

Phase	When it runs	What it checks (examples)
Periodic (`checks`)	Every 5 min on idle nodes	`/home` and `/data` NFS mount, writability, and free space; GPU count vs. registered Slurm GRES; DCGM GPU health; NVIDIA driver persistence mode; InfiniBand port link state; system load, available memory/swap, local `/tmp` and `/dev/shm` space; kernel errors in `dmesg` (MCE, EDAC, disk I/O, NFS timeouts)
Pre-job (`prolog`)	Before a job, on each allocated node	required `SLURM_*` job variables present; residual GPU memory below threshold; `CUDA_VISIBLE_DEVICES` count matches allocated GPUs; DCGM level-1 diagnostics
Post-job (`epilog`)	After a job, on each allocated node	ECC double-bit errors recorded during the job; prune stale Docker containers; reset GPUs left with residual memory; kill leftover job processes and clean up orphaned shared memory; then re-run the periodic checks

For the full list of built-in checks — including the exact failure trigger, whether it warns or drains the node, and when each is skipped — see the Default Check Reference at the bottom of this page.

Managing Your Checks

The health-check suite is managed by the SlurmClusterHealthCheck CRD (short name schc). Each cluster has two objects in the slurm namespace:

<cluster>-defaults — the Crusoe-provided checks. Managed by CSO; do not edit.
<cluster>-custom — empty by default and never overwritten by CSO. Add your own checks here.

Your scripts run automatically after the built-in scripts for the matching phase. Each script runs independently, so a crash or unhandled error in one is logged as a warning and does not affect the other checks.

tip

Use kubectl explain slurmclusterhealthchecks.spec to see all available fields and their descriptions directly from the cluster.

Adding a Check

Add a script by editing the -custom object:

kubectl edit schc <cluster>-custom -n slurm

apiVersion: slurm.crusoe.ai/v1alpha1
kind: SlurmClusterHealthCheck
metadata:
  name: <cluster>-custom
  namespace: slurm
spec:
  scripts:
    # Periodic check — runs every 5 minutes on idle nodes
    - name: "60-scratch-mount"
      type: checks
      enabled: true
      source: |
        #!/usr/bin/env bash
        if ! mountpoint -q /mnt/scratch; then
          echo "$(date -u +%FT%TZ) scratch_mount FAIL host=$(hostname -s) reason=not-mounted" >&2
          scontrol update NodeName="$(hostname -s)" State=DRAIN Reason="scratch: /mnt/scratch not mounted"
        fi
        exit 0

    # Pre-job check — runs before every job on each allocated node
    - name: "60-dataset-present"
      type: prolog
      enabled: true
      source: |
        #!/usr/bin/env bash
        if [[ ! -r /data/shared/dataset.bin ]]; then
          echo "$(date -u +%FT%TZ) dataset_present FAIL host=$(hostname -s) job=${SLURM_JOB_ID} reason=dataset-unreadable" >&2
          scontrol update NodeName="$(hostname -s)" State=DRAIN Reason="prolog: shared dataset unreadable"
          exit 1  # drains the node and requeues the job on a healthy node
        fi
        exit 0

    # Post-job check — runs after every job on each allocated node
    - name: "60-scratch-cleanup"
      type: epilog
      enabled: true
      source: |
        #!/usr/bin/env bash
        rm -rf "/mnt/scratch/job-${SLURM_JOB_ID}" || \
          echo "$(date -u +%FT%TZ) scratch_cleanup WARN host=$(hostname -s) reason=cleanup-failed" >&2
        exit 0  # epilog must always exit 0

Field reference:

name — script filename in the form NN-name (a two-digit prefix, then lowercase letters, digits, and hyphens). The prefix orders your scripts relative to each other within a phase; any number works, since your scripts always run after the built-in checks.
type — one of checks, prolog, or epilog.
enabled — set to false to keep a script defined but exclude it from execution.
source — the shell script. Write to stderr for log output (surfaced in the worker pod's logfile container via kubectl logs).

Available environment variables:

Periodic (checks) scripts run with no job context — no SLURM_JOB_* variables are set.
prolog and epilog scripts run with the standard Slurm job environment, including SLURM_JOB_ID, SLURM_JOB_USER, SLURM_JOB_UID, SLURM_JOB_NODELIST, and — for GPU jobs — CUDA_VISIBLE_DEVICES and SLURM_JOB_GPUS.

Draining and exit codes:

To drain the node, call scontrol directly: scontrol update NodeName="$(hostname -s)" State=DRAIN Reason="<short reason>" (keep the reason under 250 characters).
checks and prolog scripts should exit 0 after draining — the drain is signalled by the scontrol call, not the exit code. In a prolog script, exit 1 instead if you also want the job requeued.
epilog scripts must always exit 0 — Slurm treats a non-zero epilog exit as a node-fatal event.
For a warn-only check, just log to stderr and exit 0.

Changes are picked up automatically across all nodes within seconds of saving (no scontrol reconfigure needed) and take effect on the next run of that phase.

Viewing Your Checks

List the health-check objects for your cluster:

kubectl get schc -n slurm

To see the scripts currently defined in your custom object:

kubectl get schc <cluster>-custom -n slurm -o yaml

The scripts are mounted on each worker pod under /opt/crusoe/healthcheck, so you can inspect what's running on a node directly:

kubectl exec -n slurm <worker-pod> -c slurmd -- ls /opt/crusoe/healthcheck/

Removing a Check

Edit the -custom object and delete the script's entry from spec.scripts (or set enabled: false to keep it defined but inactive):

kubectl edit schc <cluster>-custom -n slurm

The script is removed from the affected nodes within seconds.

Per-Job Task Prolog/Epilog (User-Configured)

Users can specify prolog and epilog scripts per job using --task-prolog and --task-epilog. No admin setup is required.

In an sbatch script:

#!/bin/bash
#SBATCH --nodes=2
#SBATCH --task-prolog=/home/alice/setup.sh
#SBATCH --task-epilog=/home/alice/teardown.sh

srun ./myprogram

Or inline with srun:

srun --task-prolog=/home/alice/setup.sh --task-epilog=/home/alice/teardown.sh ./myprogram

Key behaviors:

Runs once per task (not per node), as the submitting user
stdout appears in the job's output file
Non-zero exit logs a warning but does not requeue the job
No admin involvement required

Execution Order

When both the managed health-check prolog/epilog and per-job task scripts are configured, each allocated node runs them in this order:

Step	What runs	Runs as	Scope
1	Prolog dispatcher — built-in ambient checks, then your `checks` scripts	root	Per node
2	Prolog dispatcher — built-in pre-job checks, then your `prolog` scripts	root	Per node
3	`--task-prolog` (if set on the job)	Submitting user	Per task
4	The job itself (`srun ./myprogram`)	Submitting user	Per task
5	`--task-epilog` (if set on the job)	Submitting user	Per task
6	Epilog dispatcher — built-in cleanup, then your `epilog` scripts	root	Per node
7	Epilog dispatcher — built-in ambient re-check, then your `checks` scripts	root	Per node

Default Check Reference

The built-in checks shipped in the <cluster>-defaults object, grouped by phase. GPU and InfiniBand checks skip cleanly on nodes without that hardware. Thresholds shown are the defaults. Changes to default checks (disabling checks, adjusting thresholds, changing failure behavior) can be done by reaching out to our support team.

Periodic Checks

Run every 5 minutes on idle nodes.

Check	What it verifies	On failure	Skipped when
`10-filesystem`	`/home` is NFS-mounted and writable	Warn — not NFS-mounted, or the write probe fails	—
`11-filesystem-data`	`/data` is NFS-mounted and writable	Warn — not NFS-mounted, or the write probe fails	`/data` is absent on the node
`12-filesystem-space`	`/home` has at least 50 GB free	Warn — free space below threshold, or `df` fails	—
`13-filesystem-data-space`	`/data` has at least 50 GB free	Warn — free space below threshold, or `df` fails	`/data` is absent on the node
`15-hardware`	GPU count (from Slurm GRES and `nvidia-smi`) matches the expected count (8)	Drain — GPU count below expected; Warn — GRES unavailable or the `nvidia-smi` query fails	GPU portion skipped on non-GPU nodes
`20-gpu-dcgm`	DCGM reports all GPUs healthy	Drain — DCGM overall health is a failure; Warn — DCGM reports a warning-level event, or DCGM is unreachable	Non-GPU node
`25-network-ib`	All InfiniBand ports are in the `Active` link state	Drain — any port is not `Active`	The node has no InfiniBand ports
`30-driver`	NVIDIA persistence mode is enabled	Drain — persistence mode is not enabled	Non-GPU node
`35-load`	1-minute load average is within 2× the CPU count	Warn — load above threshold	—
`40-fs-local`	`/tmp` and `/dev/shm` have at least 1 GB free and 10,000 free inodes	Drain — free space below threshold; Warn — free inodes below threshold	—
`45-memory`	At least 2 GB of memory is available and swap is not in use	Warn — low available memory, or swap in active use	—
`50-dmesg`	Kernel ring buffer is free of fatal hardware errors	Drain — machine-check exception, uncorrectable memory (EDAC) error, disk I/O error, or a recent NFS "server not responding"; Warn — correctable memory (EDAC) error	The ring buffer is empty

Pre-job Checks (Prolog)

Run on each allocated node before the job starts. A failure drains the node and requeues the job.

Check	What it verifies	On failure	Skipped when
`10-job-env`	Required job variables (`SLURM_JOB_ID`, `SLURM_JOB_USER`, `SLURM_JOB_UID`) are set	Drain + requeue — any variable is missing	—
`20-gpu-residual`	Assigned GPUs hold less than 100 MiB of residual memory	Drain + requeue — any assigned GPU is over threshold	Non-GPU node, or no GPUs assigned to the job
`25-cuda-visible`	`CUDA_VISIBLE_DEVICES` count matches the GPUs allocated to the job	Drain + requeue — visible count is below the allocated count	Non-GPU node, no GPUs assigned, or the allocated count cannot be determined
`30-dcgm-diag`	DCGM level-1 diagnostic passes on the assigned GPUs	Drain + requeue — the diagnostic fails or times out	Non-GPU node, no GPUs assigned, or a diagnostic is already running on the node

Post-job Checks (Epilog)

Run on each allocated node after the job ends. Epilog never fails the job via its exit code — health failures drain the node via scontrol instead.

Check	What it does	On failure	Skipped when
`10-dcgm-stats`	Checks job-scoped DCGM stats for ECC double-bit errors	Drain — one or more ECC double-bit errors were recorded during the job	Non-GPU node, no job ID, or stats are unavailable
`20-containers`	Prunes stopped Docker containers idle for more than 1 hour	Warn — the prune command fails	—
`30-gpu-reset`	Resets GPUs that still hold residual memory after the job	Warn — the reset fails	Non-GPU node, or no GPUs assigned
`40-processes-ipc`	Kills leftover job processes and removes orphaned shared-memory segments	Warn — lingering processes were found and killed (informational)	No job context, or the job user is a system account (UID < 1000)

Next Steps

Quickstart — Create your first Slurm cluster
User Management — Add users and groups, manage partitions
Managing Partitions — Create and manage partitions in your Slurm cluster
Slurm Metrics — Monitor cluster health and performance
Advanced: Kubernetes Operations — Direct kubectl access and CRD-level configuration
For Slurm command reference, see the official Slurm documentation

Prerequisites​

Get Your Cluster Name​

Limitations​

What the Built-in Checks Cover​

Managing Your Checks​

Adding a Check​

Viewing Your Checks​

Removing a Check​

Per-Job Task Prolog/Epilog (User-Configured)​

Execution Order​

Default Check Reference​

Periodic Checks​

Pre-job Checks (Prolog)​

Post-job Checks (Epilog)​

Next Steps​