Skip to main content

Validating Infiniband Performance

info

Crusoe Cloud is currently in alpha. If you do not currently have access, please request access to continue.

Validating IB networking Performance - NVIDIA Collectives Communication Library

You can utilize NVIDIA's Collective Communications Library (NCCL) to validate the performance of the IB networking stack once at least 2 IB supported instance types are launched within an IB partition, following these steps:

  1. Setup passwordless SSH between the two instances by first SSH connecting into one of the instances and issuing a ssh-keygen -t ed25519. Accept the defaults and then copy the contents of the ./ssh/ed25519.pub to both instances' .ssh/authorized_keys file.
  2. On each instance git clone the NVIDIA/nccl-tests repo. And compile the binaries according to the repo's instructions.
  3. One of the instances create a hostfile referencing the private IP of each instance and slots=8 which represents that each instance supports 8 ranks (8 GPUs per instance). As an example hostfile below:
172.27.21.150 slots=8
172.27.22.114 slots=8
  1. Run the below NCCL all reduce command with directives corresponding to your instance. We have chosen UCX for the transport protcol which is part of the nccl-rdma-sharp plugins. We recommend using UCX for optimized performance.
#!/bin/bash

. /opt/hpcx/hpcx-init.sh
hpcx_load

mpirun -np 16 -N 8 -x NCCL_DEBUG=INFO -hostfile hostfile \
--bind-to none -mca btl tcp,self -mca coll_hcoll_enable 0 \
-x NCCL_IB_AR_THRESHOLD=0 -x NCCL_IB_PCI_RELAXED_ORDERING=1 \
-x NCCL_IB_SPLIT_DATA_ON_QPS=0 -x NCCL_IB_QPS_PER_CONNECTION=2 -x CUDA_DEVICE_ORDER=PCI_BUS_ID \
-x PATH -x LD_LIBRARY_PATH -x NCCL_IB_HCA=mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx5_3:1,mlx5_5:1,mlx5_6:1,mlx5_7:1,mlx5_8:1 \
/home/ubuntu/nccl-tests/build/all_reduce_perf -b 8 -e 2G -f 2 -t 1 -g 1 -c 1 -n 100

With NCCL_DEBUG=INFO set you will get a verbose output of the NCCL communication layer bootstraping the connection as well as reporting the connection topology and specific environment variables used. You can confirm a few additional key points to ensure optimium performance:

  1. Ensure that each rank reports back that the correct and NCCL topology file is being used. This is a global environment variable that is set by Crusoe's curated images to ensure that the correct CPU affinity, GPUs, and IB HCAs are in the expected topology.
aragab-ib7:24224:24290 [5] NCCL INFO NCCL_TOPO_FILE set by environment to /etc/crusoe/nccl_topo/h100-80gb-sxm-ib.xml
aragab-ib7:24223:24291 [4] NCCL INFO NCCL_TOPO_FILE set by environment to /etc/crusoe/nccl_topo/h100-80gb-sxm-ib.xml
aragab-ib7:24227:24292 [7] NCCL INFO NCCL_TOPO_FILE set by environment to /etc/crusoe/nccl_topo/h100-80gb-sxm-ib.xml
  1. Crusoe's curated images load the nvidia_peermem module by default which will enable the GPUDirectRDMA. In the NCCL debug out you can confirm that by ensuring you have:
...
aragab-ib7:24224:24290 [5] NCCL INFO Channel 04/0 : 13[5] -> 4[4] [send] via NET/UCX/0/GDRDMA
aragab-ib7:24224:24290 [5] NCCL INFO Channel 12/0 : 13[5] -> 4[4] [send] via NET/UCX/0/GDRDMA
...

or

...
aragab-ib10:12931:12936 [6] NCCL INFO Channel 06/0 : 166[6] -> 150[6] [send] via NET/IBext/2/GDRDMA
aragab-ib10:12931:12936 [6] NCCL INFO Channel 22/0 : 166[6] -> 150[6] [send] via NET/IBext/2/GDRDMA
...

Depending on whether you are using the UCX or IBext plugins provided by the NVIDIA HPCX libraries. The results of the 2 instance NCCL tests should look similar to the below output for each instance type:

#                                                              out-of-place                       in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
8 2 float sum -1 35.11 0.00 0.00 0 34.63 0.00 0.00 0
16 4 float sum -1 34.29 0.00 0.00 0 34.70 0.00 0.00 0
32 8 float sum -1 34.90 0.00 0.00 0 34.66 0.00 0.00 0
64 16 float sum -1 33.03 0.00 0.00 0 33.21 0.00 0.00 0
128 32 float sum -1 33.58 0.00 0.01 0 33.44 0.00 0.01 0
256 64 float sum -1 33.81 0.01 0.01 0 33.70 0.01 0.01 0
512 128 float sum -1 36.15 0.01 0.03 0 36.11 0.01 0.03 0
1024 256 float sum -1 37.18 0.03 0.05 0 37.21 0.03 0.05 0
2048 512 float sum -1 38.58 0.05 0.10 0 38.08 0.05 0.10 0
4096 1024 float sum -1 39.94 0.10 0.19 0 39.50 0.10 0.19 0
8192 2048 float sum -1 42.07 0.19 0.37 0 41.52 0.20 0.37 0
16384 4096 float sum -1 44.84 0.37 0.69 0 42.15 0.39 0.73 0
32768 8192 float sum -1 49.97 0.66 1.23 0 42.10 0.78 1.46 0
65536 16384 float sum -1 50.16 1.31 2.45 0 52.09 1.26 2.36 0
131072 32768 float sum -1 64.25 2.04 3.83 0 62.40 2.10 3.94 0
262144 65536 float sum -1 68.07 3.85 7.22 0 66.84 3.92 7.35 0
524288 131072 float sum -1 74.99 6.99 13.11 0 72.28 7.25 13.60 0
1048576 262144 float sum -1 84.25 12.45 23.34 0 83.38 12.58 23.58 0
2097152 524288 float sum -1 109.4 19.18 35.95 0 125.8 16.67 31.25 0
4194304 1048576 float sum -1 141.9 29.56 55.43 0 156.0 26.89 50.42 0
8388608 2097152 float sum -1 203.3 41.27 77.37 0 201.2 41.68 78.16 0
16777216 4194304 float sum -1 419.8 39.97 74.94 0 270.1 62.11 116.45 0
33554432 8388608 float sum -1 493.4 68.01 127.51 0 493.6 67.98 127.46 0
67108864 16777216 float sum -1 910.9 73.67 138.14 0 911.2 73.65 138.09 0
134217728 33554432 float sum -1 1374.7 97.63 183.06 0 1371.5 97.86 183.48 0
268435456 67108864 float sum -1 2725.5 98.49 184.67 0 2672.7 100.44 188.32 0
536870912 134217728 float sum -1 5310.2 101.10 189.57 0 5297.2 101.35 190.03 0
1073741824 268435456 float sum -1 10475 102.50 192.19 0 10455 102.70 192.57 0
2147483648 536870912 float sum -1 20656 103.96 194.93 0 20685 103.82 194.66 0

If you are unable to reach your desired level of performance, please contact support.