Clusters
You can use dstack to provision multi-node clusters with InfiniBand interconnect to run distributed workloads across multiple GPU instances with high-speed networking between nodes.
With dstack, you can:
- Provision a cluster fleet on a Crusoe VM or Crusoe Managed Kubernetes (CMK) backend.
- Run distributed tasks across the fleet.
- Validate InfiniBand performance with NCCL tests.
Prerequisites
- A dstack server with a
crusoeorkubernetesbackend configured—see the Quickstart - Quota for InfiniBand GPU instance types (for example,
h100-80gb-sxm-ib.8x)
Create a cluster fleet
Setting placement: cluster on a fleet ensures all instances are interconnected.
On the crusoe backend, dstack automatically creates an InfiniBand (IB) partition and provisions the instances with IB networking, provided the selected instance type supports it. You don't need to perform any manual network setup.
Create crusoe-fleet.dstack.yml:
type: fleet
name: crusoe-fleet
nodes: 2
placement: cluster
backends: [crusoe]
resources:
gpu: A100:80GB:8
Apply the configuration:
dstack apply -f crusoe-fleet.dstack.yml
A range, such as nodes: 0..2, provisions cluster nodes on demand instead of upfront.
On the Crusoe Managed Kubernetes (CMK) backend, the same fleet selects already-provisioned nodes from your node pool instead of creating instances.
Create cmk-fleet.dstack.yml:
type: fleet
name: crusoe-fleet
placement: cluster
nodes: 0..
backends: [kubernetes]
resources:
# Specify requirements to filter nodes
gpu: 8
Apply the configuration:
dstack apply -f cmk-fleet.dstack.yml
Run distributed tasks
Tasks with nodes greater than 1 run on every node of a cluster fleet.
dstack sets the following environment variables on each node so distributed launchers work without additional configuration:
| Variable | Description |
|---|---|
DSTACK_NODE_RANK | Rank of the current node (0-indexed) |
DSTACK_NODES_NUM | Total number of nodes |
DSTACK_MASTER_NODE_IP | IP address of the master node |
DSTACK_GPUS_PER_NODE | GPUs per node |
DSTACK_GPUS_NUM | Total GPUs across the run |
DSTACK_MPI_HOSTFILE | Path to a pre-populated MPI hostfile for mpirun |
Example: torchrun training job
The following task launches a torchrun distributed training job across two nodes:
type: task
name: train-distrib
nodes: 2
commands:
- |
torchrun \
--nproc-per-node=$DSTACK_GPUS_PER_NODE \
--node-rank=$DSTACK_NODE_RANK \
--nnodes=$DSTACK_NODES_NUM \
--master-addr=$DSTACK_MASTER_NODE_IP \
multinode.py
resources:
gpu: A100:80GB:8
Validate with NCCL tests
Use a distributed task running NCCL's all_reduce_perf to validate InfiniBand bandwidth across the fleet. Use the subsection that matches your backend.
- VMs
- CMK
Crusoe VM images come with HPC-X and NCCL topology files pre-installed on the host. Mount them into the container using instance volumes.
Create nccl-tests.dstack.yml:
type: task
name: nccl-tests
nodes: 2
startup_order: workers-first
stop_criteria: master-done
volumes:
- /opt/hpcx:/opt/hpcx
- /etc/crusoe/nccl_topo:/etc/crusoe/nccl_topo
commands:
- . /opt/hpcx/hpcx-init.sh
- hpcx_load
- |
if [ $DSTACK_NODE_RANK -eq 0 ]; then
mpirun \
--allow-run-as-root \
--hostfile $DSTACK_MPI_HOSTFILE \
-n $DSTACK_GPUS_NUM \
-N $DSTACK_GPUS_PER_NODE \
--bind-to none \
-mca btl tcp,self \
-mca coll_hcoll_enable 0 \
-x PATH \
-x LD_LIBRARY_PATH \
-x CUDA_DEVICE_ORDER=PCI_BUS_ID \
-x NCCL_SOCKET_NTHREADS=4 \
-x NCCL_NSOCKS_PERTHREAD=8 \
-x NCCL_TOPO_FILE=/etc/crusoe/nccl_topo/a100-80gb-sxm-ib-cloud-hypervisor.xml \
-x NCCL_IB_MERGE_VFS=0 \
-x NCCL_IB_HCA=^mlx5_0:1 \
/opt/nccl-tests/build/all_reduce_perf -b 8 -e 2G -f 2 -t 1 -g 1 -c 1 -n 100
else
sleep infinity
fi
backends: [crusoe]
resources:
gpu: A100:80GB:8
shm_size: 16GB
Apply the configuration:
dstack apply -f nccl-tests.dstack.yml
The example above uses the topology file for a100-80gb-sxm-ib. Set NCCL_TOPO_FILE to match your actual instance type. Topology files for all supported types are available under /etc/crusoe/nccl_topo/ on the host.
On CMK, HPC-X and the topology files aren't pre-installed in containers, so you need to install them inside the task. You must also set privileged: true so the container can access InfiniBand devices.
Create nccl-tests.dstack.yml:
type: task
name: nccl-tests
nodes: 2
startup_order: workers-first
stop_criteria: master-done
commands:
# Install NCCL topology files
# (the most reliable source is to copy them from /etc/crusoe/nccl_topo
# on a Crusoe-provisioned VM)
# ...
# Install and initialize HPC-X
- curl -sSL https://content.mellanox.com/hpc/hpc-x/v2.21.3/hpcx-v2.21.3-gcc-doca_ofed-ubuntu22.04-cuda12-x86_64.tbz -o hpcx.tar.bz
- mkdir -p /opt/hpcx
- tar -C /opt/hpcx -xf hpcx.tar.bz --strip-components=1 --checkpoint=10000
- . /opt/hpcx/hpcx-init.sh
- hpcx_load
# Run NCCL tests with mpirun, as in the VM example above
# ...
# Required for InfiniBand access
privileged: true
backends: [kubernetes]
resources:
gpu: A100:8
shm_size: 16GB
The snippet above omits the full mpirun invocation. See the dstack Crusoe guide for the complete CMK NCCL example, including the additional NCCL_IB_* and UCX_NET_DEVICES settings it requires.
Troubleshoot cluster and NCCL issues
Use the following table to resolve common cluster and NCCL issues:
| Issue | Resolution |
|---|---|
| Low NCCL bandwidth | Confirm NCCL_TOPO_FILE matches your instance type, and that /opt/hpcx and /etc/crusoe/nccl_topo are mounted. |
| InfiniBand devices not visible on CMK | Make sure privileged: true is set on the task and that the node pool uses an IB-enabled instance type (*-ib.8x). |
For provisioning issues (quota, offers, CMK connectivity), see the Troubleshooting section in the Quickstart.
Next steps
- InfiniBand networking - Crusoe's high-speed interconnect
- Distributed tasks and Fleets - Use dstack's documentation to learn more about dstack's distributed computing features.