Every Crusoe node runs up to 30 validation tests before it becomes available for you to use. Tests cover three domains—node, GPU, and fabric—and follow this escalating pattern: sanity → validation → performance → stress.
If a node fails a test, it doesn't enter the fleet. Crusoe triages, repairs or replaces, and re-validates nodes before deployment.
Test domains
| DOMAIN | APPLIES TO | PURPOSE |
|---|
| Node | All nodes | CPU, memory, storage, networking, and PCIe. |
| NVIDIA GPU | NVIDIA nodes | DCGM diagnostics, ECC, PCIe link, and power. |
| AMD GPU | AMD nodes | AGFHC diagnostics, HBM, compute, and performance. |
| Fabric | InfiniBand and RoCE nodes | RDMA bandwidth and collective communications. |
Node tests
The following tests are run on every node regardless of GPU configuration:
Sanity
| TEST | WHAT IT CHECKS |
|---|
node/global/sanity/boot | Node boots successfully; basic system services are running. |
node/global/sanity/pcie | Node's expected GPUs, NICs, and NVMe controllers are enumerated and visible. |
node/global/sanity/registry | Node internet connectivity is established and validated. |
Validation
| TEST | WHAT IT CHECKS |
|---|
node/global/validation/nic | Node's ethernet interface configuration, link state, and driver settings validated. |
| TEST | WHAT IT CHECKS |
|---|
node/global/perf/cpu | CPU throughput (sysbench). |
node/global/perf/memory | Host memory bandwidth (sysbench). |
node/global/perf/storage-block-bw | Block storage sequential read/write throughput. |
node/global/perf/storage-block-iops | Block storage random IOPS. |
node/global/perf/storage-block-lat | Block storage I/O latency. |
node/global/perf/storage-ephemeral-bw | Local NVMe sequential read/write throughput. |
node/global/perf/storage-ephemeral-iops | Local NVMe random IOPS. |
node/global/perf/storage-ephemeral-lat | Local NVMe I/O latency. |
Stress
| TEST | WHAT IT CHECKS |
|---|
node/global/stress/cpu | Sustained CPU stability and thermal behavior under load. |
NVIDIA GPU tests
The following tests are run on all NVIDIA GPU nodes:
| TEST | WHAT IT CHECKS |
|---|
gpu/nv/sanity/health | GPU health: GPU presence, driver health, ECC errors, thermal throttling events, and prior XID errors utilizing DCGMI level 2. |
gpu/nv/validation/health | GPU memory, compute, and subsystem validation utilizing DCGMI level 4. |
gpu/nv/validation/ecc | Validate ECC is enabled with no uncorrectable errors. |
gpu/nv/validation/pcie | PCIe link width and speed match rated values for the SKU type; any link below spec is flagged critical. |
gpu/nv/validation/power | GPU reaches and sustains rated TDP under continuous compute load (DCGMI). |
gpu/nv/stress/power | GPU stability under pulsed load (20 ramp-up/ramp-down cycles); catches marginal power delivery hardware errors that steady-state testing misses. |
AMD GPU tests
The following tests are run on all AMD GPU nodes:
| TEST | WHAT IT CHECKS |
|---|
gpu/amd/validation/agfhc | AGFHC Level 5—full-node scan of compute units, HBM controllers, and memory subsystem. |
gpu/amd/stress/hbm | HBM sustained bandwidth; catches memory controller issues that survive compute-only stress tests. |
gpu/amd/stress/compute | Sustained arithmetic load across AMD GPUs. |
gpu/amd/stress/full | Combined GPU test: combined compute, memory, and thermal load on a single GPU. |
system/amd/stress/mixed | Mixed workload across GPU, CPU, and host memory simultaneously; matches real training load profiles. |
gpu/amd/stress/hpl | ROCm HPL benchmark—exercises full HIP stack, memory bandwidth, and interconnect. |
Fabric tests
The following tests are run on all InfiniBand and RoCE-enabled nodes:
Global (all fabric nodes)
| TEST | WHAT IT CHECKS |
|---|
fabric/global/sanity/link | InfiniBand or RoCE links are up and not in a degraded state. |
fabric/global/perf/ethernet | Validates NIC and switch fabric: Frontend Ethernet throughput between node pairs (utilizing iperf3). |
NVIDIA InfiniBand
The following tests are layered to isolate local hardware failures from fabric failures:
| TEST | WHAT IT CHECKS |
|---|
fabric/nv/perf/nvlink | Single node: NVLink bandwidth using local IP as RDMA target. |
fabric/nv/perf/gpu-p2p | Single node: GPU-to-GPU bandwidth using NCCL on a single node. |
fabric/nv/perf/rdma-bw | Multi-node: InfiniBand fabric bandwidth, error rates, and GPU-Direct RDMA1. |
fabric/nv/stress/collective | Multi-node: NCCL all-reduce over InfiniBand backend network. |
AMD RoCE
| TEST | WHAT IT CHECKS |
|---|
fabric/amd/perf/rdma-bw | Multi-node: GPU-Direct RDMA bandwidth across all 8 NICs per node; includes fault attribution to avoid pulling healthy nodes. |
fabric/amd/stress/collective | Multi-node: RCCL all-reduce over RoCE—required to pass before distributed training is considered fleet-ready. |
Cluster handoff
Beyond automated burn-in, Crusoe validates cluster deployments with real distributed workloads before handoff, running multi-node PyTorch jobs to confirm the full software stack behaves correctly under customer-representative load.
Additional resources