Clusters

You can use dstack to provision multi-node clusters with InfiniBand interconnect to run distributed workloads across multiple GPU instances with high-speed networking between nodes.

With dstack, you can:

Provision a cluster fleet on a Crusoe VM or Crusoe Managed Kubernetes (CMK) backend.
Run distributed tasks across the fleet.
Validate InfiniBand performance with NCCL tests.

Prerequisites

A dstack server with a crusoe or kubernetes backend configured—see the Quickstart
Quota for InfiniBand GPU instance types (for example, h100-80gb-sxm-ib.8x)

Create a cluster fleet

Setting placement: cluster on a fleet ensures all instances are interconnected.

On the crusoe backend, dstack automatically creates an InfiniBand (IB) partition and provisions the instances with IB networking, provided the selected instance type supports it. You don't need to perform any manual network setup.

Create crusoe-fleet.dstack.yml:

type: fleet
name: crusoe-fleet

nodes: 2
placement: cluster

backends: [crusoe]

resources:
  gpu: A100:80GB:8

Apply the configuration:

dstack apply -f crusoe-fleet.dstack.yml

A range, such as nodes: 0..2, provisions cluster nodes on demand instead of upfront.

On the Crusoe Managed Kubernetes (CMK) backend, the same fleet selects already-provisioned nodes from your node pool instead of creating instances.

Create cmk-fleet.dstack.yml:

type: fleet
name: crusoe-fleet

placement: cluster
nodes: 0..

backends: [kubernetes]

resources:
  # Specify requirements to filter nodes
  gpu: 8

Apply the configuration:

dstack apply -f cmk-fleet.dstack.yml

Run distributed tasks

Tasks with nodes greater than 1 run on every node of a cluster fleet.

dstack sets the following environment variables on each node so distributed launchers work without additional configuration:

Variable	Description
`DSTACK_NODE_RANK`	Rank of the current node (0-indexed)
`DSTACK_NODES_NUM`	Total number of nodes
`DSTACK_MASTER_NODE_IP`	IP address of the master node
`DSTACK_GPUS_PER_NODE`	GPUs per node
`DSTACK_GPUS_NUM`	Total GPUs across the run
`DSTACK_MPI_HOSTFILE`	Path to a pre-populated MPI hostfile for `mpirun`

Example: torchrun training job

The following task launches a torchrun distributed training job across two nodes:

type: task
name: train-distrib

nodes: 2

commands:
  - |
    torchrun \
      --nproc-per-node=$DSTACK_GPUS_PER_NODE \
      --node-rank=$DSTACK_NODE_RANK \
      --nnodes=$DSTACK_NODES_NUM \
      --master-addr=$DSTACK_MASTER_NODE_IP \
      multinode.py

resources:
  gpu: A100:80GB:8

Validate with NCCL tests

Use a distributed task running NCCL's all_reduce_perf to validate InfiniBand bandwidth across the fleet. Use the subsection that matches your backend.

Crusoe VM images come with HPC-X and NCCL topology files pre-installed on the host. Mount them into the container using instance volumes.

Create nccl-tests.dstack.yml:

type: task
name: nccl-tests

nodes: 2
startup_order: workers-first
stop_criteria: master-done

volumes:
  - /opt/hpcx:/opt/hpcx
  - /etc/crusoe/nccl_topo:/etc/crusoe/nccl_topo

commands:
  - . /opt/hpcx/hpcx-init.sh
  - hpcx_load
  - |
    if [ $DSTACK_NODE_RANK -eq 0 ]; then
      mpirun \
        --allow-run-as-root \
        --hostfile $DSTACK_MPI_HOSTFILE \
        -n $DSTACK_GPUS_NUM \
        -N $DSTACK_GPUS_PER_NODE \
        --bind-to none \
        -mca btl tcp,self \
        -mca coll_hcoll_enable 0 \
        -x PATH \
        -x LD_LIBRARY_PATH \
        -x CUDA_DEVICE_ORDER=PCI_BUS_ID \
        -x NCCL_SOCKET_NTHREADS=4 \
        -x NCCL_NSOCKS_PERTHREAD=8 \
        -x NCCL_TOPO_FILE=/etc/crusoe/nccl_topo/a100-80gb-sxm-ib-cloud-hypervisor.xml \
        -x NCCL_IB_MERGE_VFS=0 \
        -x NCCL_IB_HCA=^mlx5_0:1 \
        /opt/nccl-tests/build/all_reduce_perf -b 8 -e 2G -f 2 -t 1 -g 1 -c 1 -n 100
    else
      sleep infinity
    fi

backends: [crusoe]

resources:
  gpu: A100:80GB:8
  shm_size: 16GB

Apply the configuration:

dstack apply -f nccl-tests.dstack.yml

note

The example above uses the topology file for a100-80gb-sxm-ib. Set NCCL_TOPO_FILE to match your actual instance type. Topology files for all supported types are available under /etc/crusoe/nccl_topo/ on the host.

On CMK, HPC-X and the topology files aren't pre-installed in containers, so you need to install them inside the task. You must also set privileged: true so the container can access InfiniBand devices.

Create nccl-tests.dstack.yml:

type: task
name: nccl-tests

nodes: 2
startup_order: workers-first
stop_criteria: master-done

commands:
  # Install NCCL topology files
  # (the most reliable source is to copy them from /etc/crusoe/nccl_topo
  # on a Crusoe-provisioned VM)
  # ...
  # Install and initialize HPC-X
  - curl -sSL https://content.mellanox.com/hpc/hpc-x/v2.21.3/hpcx-v2.21.3-gcc-doca_ofed-ubuntu22.04-cuda12-x86_64.tbz -o hpcx.tar.bz
  - mkdir -p /opt/hpcx
  - tar -C /opt/hpcx -xf hpcx.tar.bz --strip-components=1 --checkpoint=10000
  - . /opt/hpcx/hpcx-init.sh
  - hpcx_load
  # Run NCCL tests with mpirun, as in the VM example above
  # ...

# Required for InfiniBand access
privileged: true

backends: [kubernetes]

resources:
  gpu: A100:8
  shm_size: 16GB

Complete CMK example

The snippet above omits the full mpirun invocation. See the dstack Crusoe guide for the complete CMK NCCL example, including the additional NCCL_IB_* and UCX_NET_DEVICES settings it requires.

Troubleshoot cluster and NCCL issues

Use the following table to resolve common cluster and NCCL issues:

Issue	Resolution
Low NCCL bandwidth	Confirm `NCCL_TOPO_FILE` matches your instance type, and that `/opt/hpcx` and `/etc/crusoe/nccl_topo` are mounted.
InfiniBand devices not visible on CMK	Make sure `privileged: true` is set on the task and that the node pool uses an IB-enabled instance type (`*-ib.8x`).

For provisioning issues (quota, offers, CMK connectivity), see the Troubleshooting section in the Quickstart.

Next steps

InfiniBand networking - Crusoe's high-speed interconnect
Distributed tasks and Fleets - Use dstack's documentation to learn more about dstack's distributed computing features.

Prerequisites​

Create a cluster fleet​

Run distributed tasks​

Validate with NCCL tests​

Troubleshoot cluster and NCCL issues​

Next steps​

Prerequisites

Create a cluster fleet

Run distributed tasks

Validate with NCCL tests

Troubleshoot cluster and NCCL issues

Next steps