Skip to main content

Managing InfiniBand Networking

Overview

Crusoe Cloud supports high performance interconnects utilizing NVIDIA Mellanox InfiniBand (IB) networking. The fabric is currently supported for the instance types in the table below:

Instance TypeNumber of Infiniband HCAs per InstanceTotal InfiniBand Bandwidth (Gbps)
a100-80gb-sxm-ib.8x81600
h100-80gb-sxm-ib.8x83200

The general workflow, which will be discussed in more detail below, emcompasses selecting an IB network for the list of networks available within a location, and then creating an IB parition from within that network. Finally you will launch instances into that partition. This ensures cluster tenancy within a parition to maximize performance with cluster wide isolation.

InfiniBand VM image

Crusoe Cloud provides a default VM image that comes with all the software libraries and tools nessecesary to take advantage InfiniBand through supported ML and HPC frameworks. While you are not required to use this image to use InfiniBand networking, we strongly recommend the ubuntu20.04-nvidia-sxm-docker:latest or ubuntu22.04-nvidia-sxm-docker:latest image for the easiest possible setup. Learn more about images.

NCCL_TOPO_FILE

For Crusoe curated images, we provide a NCCL Topology file which provides NCCL a system topology map of the the CPUs, GPUs and IB HCAs relative to PCIe bridges, which ensures optimial performance when running ML workloads. The *-nvidia-sxm-docker images also come with a service unit file which sets the NCCL_TOPO_FILE environmental variable.

If you are not using this image, or want to run containers, you will need to provide this config or pass the path of NCCL_TOPO_FILE in /etc/nccl.conf or using the NVVL_TOPO_FILE environment variable on your VM or in your container. Provided below are the NCCL Topology files:

Instance TypeNCCL Topology File
a100-80gb-sxm-ib.8xa100-80gb-sxm-ib.8x
h100-80gb-sxm-ib.8xh100-80gb-sxm-ib.8x

InfiniBand Networks

InfiniBand Networks are a logical representation of the physical InfiniBand fabric.

InfiniBand Network limitations

You are limited to a maximum of five InfiniBand partitions on the same InfiniBand network.

Listing InfiniBand Networks and Partitions

Use the networking ib-networks list and networking ib-partitions list commands to list networks and partitions.

crusoe networking ib-networks list
crusoe networking ib-partitions list

The IDs will be used when attaching a VM to a partition.

Creating InfiniBand Partitions

Use the networking ib-partitions create command to create a new partition.

crusoe networking ib-partitions create \
--name my-new-partition \
--ib-network-id uuid-of-network

Update an existing InfiniBand Partition

Updating Infiniband partitions is currently unsupported in the Crusoe Cloud CLI.

Deleting an InfiniBand Partition

info

Warning: deleting an InfiniBand partition is a permanant action that will require re-creation of the partition to recover.

Infiniband partitions can be deleted in the CLI using the networking ib-partitions delete <id> command.

Launching Instances in an InfiniBand Parition

Use the compute vms create command to create a new VM, passing in the --ib-partition-id:

crusoe compute vms create \
--name infiniband-test \
--location us-east1-a \
--type a100-80gb-sxm-ib.8x \
--image ubuntu20.04-nvidia-sxm-docker:latest \
--ib-partition-id uuid-of-partition \
...

Updating the IB partition on a VM

Use the compute vms update on a stopped VM to update the --ib-partition-id.

crusoe compute vms update \
--ib-partition-id uuid-of-new-partition