Skip to main content

Burn-in validation

Every Crusoe node runs up to 30 validation tests before it becomes available for you to use. Tests cover three domains—node, GPU, and fabric—and follow this escalating pattern: sanity → validation → performance → stress.

If a node fails a test, it doesn't enter the fleet. Crusoe triages, repairs or replaces, and re-validates nodes before deployment.

Test domains

DOMAINAPPLIES TOPURPOSE
NodeAll nodesCPU, memory, storage, networking, and PCIe.
NVIDIA GPUNVIDIA nodesDCGM diagnostics, ECC, PCIe link, and power.
AMD GPUAMD nodesAGFHC diagnostics, HBM, compute, and performance.
FabricInfiniBand and RoCE nodesRDMA bandwidth and collective communications.

Node tests

The following tests are run on every node regardless of GPU configuration:

Sanity

TESTWHAT IT CHECKS
node/global/sanity/bootNode boots successfully; basic system services are running.
node/global/sanity/pcieNode's expected GPUs, NICs, and NVMe controllers are enumerated and visible.
node/global/sanity/registryNode internet connectivity is established and validated.

Validation

TESTWHAT IT CHECKS
node/global/validation/nicNode's ethernet interface configuration, link state, and driver settings validated.

Performance

TESTWHAT IT CHECKS
node/global/perf/cpuCPU throughput (sysbench).
node/global/perf/memoryHost memory bandwidth (sysbench).
node/global/perf/storage-block-bwBlock storage sequential read/write throughput.
node/global/perf/storage-block-iopsBlock storage random IOPS.
node/global/perf/storage-block-latBlock storage I/O latency.
node/global/perf/storage-ephemeral-bwLocal NVMe sequential read/write throughput.
node/global/perf/storage-ephemeral-iopsLocal NVMe random IOPS.
node/global/perf/storage-ephemeral-latLocal NVMe I/O latency.

Stress

TESTWHAT IT CHECKS
node/global/stress/cpuSustained CPU stability and thermal behavior under load.

NVIDIA GPU tests

The following tests are run on all NVIDIA GPU nodes:

TESTWHAT IT CHECKS
gpu/nv/sanity/healthGPU health: GPU presence, driver health, ECC errors, thermal throttling events, and prior XID errors utilizing DCGMI level 2.
gpu/nv/validation/healthGPU memory, compute, and subsystem validation utilizing DCGMI level 4.
gpu/nv/validation/eccValidate ECC is enabled with no uncorrectable errors.
gpu/nv/validation/pciePCIe link width and speed match rated values for the SKU type; any link below spec is flagged critical.
gpu/nv/validation/powerGPU reaches and sustains rated TDP under continuous compute load (DCGMI).
gpu/nv/stress/powerGPU stability under pulsed load (20 ramp-up/ramp-down cycles); catches marginal power delivery hardware errors that steady-state testing misses.

AMD GPU tests

The following tests are run on all AMD GPU nodes:

TESTWHAT IT CHECKS
gpu/amd/validation/agfhcAGFHC Level 5—full-node scan of compute units, HBM controllers, and memory subsystem.
gpu/amd/stress/hbmHBM sustained bandwidth; catches memory controller issues that survive compute-only stress tests.
gpu/amd/stress/computeSustained arithmetic load across AMD GPUs.
gpu/amd/stress/fullCombined GPU test: combined compute, memory, and thermal load on a single GPU.
system/amd/stress/mixedMixed workload across GPU, CPU, and host memory simultaneously; matches real training load profiles.
gpu/amd/stress/hplROCm HPL benchmark—exercises full HIP stack, memory bandwidth, and interconnect.

Fabric tests

The following tests are run on all InfiniBand and RoCE-enabled nodes:

Global (all fabric nodes)

TESTWHAT IT CHECKS
fabric/global/sanity/linkInfiniBand or RoCE links are up and not in a degraded state.
fabric/global/perf/ethernetValidates NIC and switch fabric: Frontend Ethernet throughput between node pairs (utilizing iperf3).

NVIDIA InfiniBand

The following tests are layered to isolate local hardware failures from fabric failures:

TESTWHAT IT CHECKS
fabric/nv/perf/nvlinkSingle node: NVLink bandwidth using local IP as RDMA target.
fabric/nv/perf/gpu-p2pSingle node: GPU-to-GPU bandwidth using NCCL on a single node.
fabric/nv/perf/rdma-bwMulti-node: InfiniBand fabric bandwidth, error rates, and GPU-Direct RDMA1.
fabric/nv/stress/collectiveMulti-node: NCCL all-reduce over InfiniBand backend network.

AMD RoCE

TESTWHAT IT CHECKS
fabric/amd/perf/rdma-bwMulti-node: GPU-Direct RDMA bandwidth across all 8 NICs per node; includes fault attribution to avoid pulling healthy nodes.
fabric/amd/stress/collectiveMulti-node: RCCL all-reduce over RoCE—required to pass before distributed training is considered fleet-ready.

Cluster handoff

Beyond automated burn-in, Crusoe validates cluster deployments with real distributed workloads before handoff, running multi-node PyTorch jobs to confirm the full software stack behaves correctly under customer-representative load.

Additional resources