Skip to main content

Clusters

You can use dstack to provision multi-node clusters with InfiniBand interconnect to run distributed workloads across multiple GPU instances with high-speed networking between nodes.

With dstack, you can:

Prerequisites

  • A dstack server with a crusoe or kubernetes backend configured—see the Quickstart
  • Quota for InfiniBand GPU instance types (for example, h100-80gb-sxm-ib.8x)

Create a cluster fleet

Setting placement: cluster on a fleet ensures all instances are interconnected.

On the crusoe backend, dstack automatically creates an InfiniBand (IB) partition and provisions the instances with IB networking, provided the selected instance type supports it. You don't need to perform any manual network setup.

Create crusoe-fleet.dstack.yml:

type: fleet
name: crusoe-fleet

nodes: 2
placement: cluster

backends: [crusoe]

resources:
gpu: A100:80GB:8

Apply the configuration:

dstack apply -f crusoe-fleet.dstack.yml

A range, such as nodes: 0..2, provisions cluster nodes on demand instead of upfront.

On the Crusoe Managed Kubernetes (CMK) backend, the same fleet selects already-provisioned nodes from your node pool instead of creating instances.

Create cmk-fleet.dstack.yml:

type: fleet
name: crusoe-fleet

placement: cluster
nodes: 0..

backends: [kubernetes]

resources:
# Specify requirements to filter nodes
gpu: 8

Apply the configuration:

dstack apply -f cmk-fleet.dstack.yml

Run distributed tasks

Tasks with nodes greater than 1 run on every node of a cluster fleet.

dstack sets the following environment variables on each node so distributed launchers work without additional configuration:

VariableDescription
DSTACK_NODE_RANKRank of the current node (0-indexed)
DSTACK_NODES_NUMTotal number of nodes
DSTACK_MASTER_NODE_IPIP address of the master node
DSTACK_GPUS_PER_NODEGPUs per node
DSTACK_GPUS_NUMTotal GPUs across the run
DSTACK_MPI_HOSTFILEPath to a pre-populated MPI hostfile for mpirun

Example: torchrun training job

The following task launches a torchrun distributed training job across two nodes:

type: task
name: train-distrib

nodes: 2

commands:
- |
torchrun \
--nproc-per-node=$DSTACK_GPUS_PER_NODE \
--node-rank=$DSTACK_NODE_RANK \
--nnodes=$DSTACK_NODES_NUM \
--master-addr=$DSTACK_MASTER_NODE_IP \
multinode.py

resources:
gpu: A100:80GB:8

Validate with NCCL tests

Use a distributed task running NCCL's all_reduce_perf to validate InfiniBand bandwidth across the fleet. Use the subsection that matches your backend.

Crusoe VM images come with HPC-X and NCCL topology files pre-installed on the host. Mount them into the container using instance volumes.

Create nccl-tests.dstack.yml:

type: task
name: nccl-tests

nodes: 2
startup_order: workers-first
stop_criteria: master-done

volumes:
- /opt/hpcx:/opt/hpcx
- /etc/crusoe/nccl_topo:/etc/crusoe/nccl_topo

commands:
- . /opt/hpcx/hpcx-init.sh
- hpcx_load
- |
if [ $DSTACK_NODE_RANK -eq 0 ]; then
mpirun \
--allow-run-as-root \
--hostfile $DSTACK_MPI_HOSTFILE \
-n $DSTACK_GPUS_NUM \
-N $DSTACK_GPUS_PER_NODE \
--bind-to none \
-mca btl tcp,self \
-mca coll_hcoll_enable 0 \
-x PATH \
-x LD_LIBRARY_PATH \
-x CUDA_DEVICE_ORDER=PCI_BUS_ID \
-x NCCL_SOCKET_NTHREADS=4 \
-x NCCL_NSOCKS_PERTHREAD=8 \
-x NCCL_TOPO_FILE=/etc/crusoe/nccl_topo/a100-80gb-sxm-ib-cloud-hypervisor.xml \
-x NCCL_IB_MERGE_VFS=0 \
-x NCCL_IB_HCA=^mlx5_0:1 \
/opt/nccl-tests/build/all_reduce_perf -b 8 -e 2G -f 2 -t 1 -g 1 -c 1 -n 100
else
sleep infinity
fi

backends: [crusoe]

resources:
gpu: A100:80GB:8
shm_size: 16GB

Apply the configuration:

dstack apply -f nccl-tests.dstack.yml
note

The example above uses the topology file for a100-80gb-sxm-ib. Set NCCL_TOPO_FILE to match your actual instance type. Topology files for all supported types are available under /etc/crusoe/nccl_topo/ on the host.

Troubleshoot cluster and NCCL issues

Use the following table to resolve common cluster and NCCL issues:

IssueResolution
Low NCCL bandwidthConfirm NCCL_TOPO_FILE matches your instance type, and that /opt/hpcx and /etc/crusoe/nccl_topo are mounted.
InfiniBand devices not visible on CMKMake sure privileged: true is set on the task and that the node pool uses an IB-enabled instance type (*-ib.8x).

For provisioning issues (quota, offers, CMK connectivity), see the Troubleshooting section in the Quickstart.

Next steps