Burn-in validation

Every Crusoe node runs up to 30 validation tests before it becomes available for you to use. Tests cover three domains—node, GPU, and fabric—and follow this escalating pattern: sanity → validation → performance → stress.

If a node fails a test, it doesn't enter the fleet. Crusoe triages, repairs or replaces, and re-validates nodes before deployment.

Test domains

DOMAIN	APPLIES TO	PURPOSE
Node	All nodes	CPU, memory, storage, networking, and PCIe.
NVIDIA GPU	NVIDIA nodes	DCGM diagnostics, ECC, PCIe link, and power.
AMD GPU	AMD nodes	AGFHC diagnostics, HBM, compute, and performance.
Fabric	InfiniBand and RoCE nodes	RDMA bandwidth and collective communications.

Node tests

The following tests are run on every node regardless of GPU configuration:

Sanity

TEST	WHAT IT CHECKS
`node/global/sanity/boot`	Node boots successfully; basic system services are running.
`node/global/sanity/pcie`	Node's expected GPUs, NICs, and NVMe controllers are enumerated and visible.
`node/global/sanity/registry`	Node internet connectivity is established and validated.

Validation

TEST	WHAT IT CHECKS
`node/global/validation/nic`	Node's ethernet interface configuration, link state, and driver settings validated.

Performance

TEST	WHAT IT CHECKS
`node/global/perf/cpu`	CPU throughput (sysbench).
`node/global/perf/memory`	Host memory bandwidth (sysbench).
`node/global/perf/storage-block-bw`	Block storage sequential read/write throughput.
`node/global/perf/storage-block-iops`	Block storage random IOPS.
`node/global/perf/storage-block-lat`	Block storage I/O latency.
`node/global/perf/storage-ephemeral-bw`	Local NVMe sequential read/write throughput.
`node/global/perf/storage-ephemeral-iops`	Local NVMe random IOPS.
`node/global/perf/storage-ephemeral-lat`	Local NVMe I/O latency.

Stress

TEST	WHAT IT CHECKS
`node/global/stress/cpu`	Sustained CPU stability and thermal behavior under load.

NVIDIA GPU tests

The following tests are run on all NVIDIA GPU nodes:

TEST	WHAT IT CHECKS
`gpu/nv/sanity/health`	GPU health: GPU presence, driver health, ECC errors, thermal throttling events, and prior XID errors utilizing DCGMI level 2.
`gpu/nv/validation/health`	GPU memory, compute, and subsystem validation utilizing DCGMI level 4.
`gpu/nv/validation/ecc`	Validate ECC is enabled with no uncorrectable errors.
`gpu/nv/validation/pcie`	PCIe link width and speed match rated values for the SKU type; any link below spec is flagged critical.
`gpu/nv/validation/power`	GPU reaches and sustains rated TDP under continuous compute load (DCGMI).
`gpu/nv/stress/power`	GPU stability under pulsed load (20 ramp-up/ramp-down cycles); catches marginal power delivery hardware errors that steady-state testing misses.

AMD GPU tests

The following tests are run on all AMD GPU nodes:

TEST	WHAT IT CHECKS
`gpu/amd/validation/agfhc`	AGFHC Level 5—full-node scan of compute units, HBM controllers, and memory subsystem.
`gpu/amd/stress/hbm`	HBM sustained bandwidth; catches memory controller issues that survive compute-only stress tests.
`gpu/amd/stress/compute`	Sustained arithmetic load across AMD GPUs.
`gpu/amd/stress/full`	Combined GPU test: combined compute, memory, and thermal load on a single GPU.
`system/amd/stress/mixed`	Mixed workload across GPU, CPU, and host memory simultaneously; matches real training load profiles.
`gpu/amd/stress/hpl`	ROCm HPL benchmark—exercises full HIP stack, memory bandwidth, and interconnect.

Fabric tests

The following tests are run on all InfiniBand and RoCE-enabled nodes:

Global (all fabric nodes)

TEST	WHAT IT CHECKS
`fabric/global/sanity/link`	InfiniBand or RoCE links are up and not in a degraded state.
`fabric/global/perf/ethernet`	Validates NIC and switch fabric: Frontend Ethernet throughput between node pairs (utilizing iperf3).

NVIDIA InfiniBand

The following tests are layered to isolate local hardware failures from fabric failures:

TEST	WHAT IT CHECKS
`fabric/nv/perf/nvlink`	Single node: NVLink bandwidth using local IP as RDMA target.
`fabric/nv/perf/gpu-p2p`	Single node: GPU-to-GPU bandwidth using NCCL on a single node.
`fabric/nv/perf/rdma-bw`	Multi-node: InfiniBand fabric bandwidth, error rates, and GPU-Direct RDMA1.
`fabric/nv/stress/collective`	Multi-node: NCCL all-reduce over InfiniBand backend network.

AMD RoCE

TEST	WHAT IT CHECKS
`fabric/amd/perf/rdma-bw`	Multi-node: GPU-Direct RDMA bandwidth across all 8 NICs per node; includes fault attribution to avoid pulling healthy nodes.
`fabric/amd/stress/collective`	Multi-node: RCCL all-reduce over RoCE—required to pass before distributed training is considered fleet-ready.

Cluster handoff

Beyond automated burn-in, Crusoe validates cluster deployments with real distributed workloads before handoff, running multi-node PyTorch jobs to confirm the full software stack behaves correctly under customer-representative load.

Test domains​

Node tests​

Sanity​

Validation​

Performance​

Stress​

NVIDIA GPU tests​

AMD GPU tests​

Fabric tests​

Global (all fabric nodes)​

NVIDIA InfiniBand​

AMD RoCE​

Cluster handoff​

Additional resources​