Validating Infiniband Performance
Crusoe Cloud is currently in alpha. If you do not currently have access, please request access to continue.
Validating IB networking Performance - NVIDIA Collectives Communication Library
You can utilize NVIDIA's Collective Communications Library (NCCL) to validate the performance of the IB networking stack once at least 2 IB supported instance types are launched within an IB partition, following these steps:
- Setup passwordless SSH between the two instances by first SSH connecting into one of the instances and issuing a
ssh-keygen -t ed25519
. Accept the defaults and then copy the contents of the./ssh/ed25519.pub
to both instances'.ssh/authorized_keys
file. - On each instance git clone the NVIDIA/nccl-tests repo. And compile the binaries according to the repo's instructions.
- One of the instances create a
hostfile
referencing the private IP of each instance andslots=8
which represents that each instance supports 8 ranks (8 GPUs per instance). As an example hostfile below:
172.27.21.150 slots=8
172.27.22.114 slots=8
- Run the below NCCL all reduce command with directives corresponding to your instance. We have chosen UCX for the transport protcol which is part of the nccl-rdma-sharp plugins. We recommend using UCX for optimized performance.
#!/bin/bash
. /opt/hpcx/hpcx-init.sh
hpcx_load
mpirun -np 16 -N 8 -x NCCL_DEBUG=INFO -hostfile hostfile \
--bind-to none -mca btl tcp,self -mca coll_hcoll_enable 0 \
-x NCCL_IB_AR_THRESHOLD=0 -x NCCL_IB_PCI_RELAXED_ORDERING=1 \
-x NCCL_IB_SPLIT_DATA_ON_QPS=0 -x NCCL_IB_QPS_PER_CONNECTION=2 -x CUDA_DEVICE_ORDER=PCI_BUS_ID \
-x PATH -x LD_LIBRARY_PATH -x NCCL_IB_HCA=mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx5_3:1,mlx5_5:1,mlx5_6:1,mlx5_7:1,mlx5_8:1 \
/home/ubuntu/nccl-tests/build/all_reduce_perf -b 8 -e 2G -f 2 -t 1 -g 1 -c 1 -n 100
With NCCL_DEBUG=INFO
set you will get a verbose output of the NCCL communication layer bootstraping the connection as well as reporting the connection topology and specific environment variables used. You can confirm a few additional key points to ensure optimium performance:
- Ensure that each rank reports back that the correct and NCCL topology file is being used. This is a global environment variable that is set by Crusoe's curated images to ensure that the correct CPU affinity, GPUs, and IB HCAs are in the expected topology.
aragab-ib7:24224:24290 [5] NCCL INFO NCCL_TOPO_FILE set by environment to /etc/crusoe/nccl_topo/h100-80gb-sxm-ib.xml
aragab-ib7:24223:24291 [4] NCCL INFO NCCL_TOPO_FILE set by environment to /etc/crusoe/nccl_topo/h100-80gb-sxm-ib.xml
aragab-ib7:24227:24292 [7] NCCL INFO NCCL_TOPO_FILE set by environment to /etc/crusoe/nccl_topo/h100-80gb-sxm-ib.xml
- Crusoe's curated images load the
nvidia_peermem
module by default which will enable the GPUDirectRDMA. In the NCCL debug out you can confirm that by ensuring you have:
...
aragab-ib7:24224:24290 [5] NCCL INFO Channel 04/0 : 13[5] -> 4[4] [send] via NET/UCX/0/GDRDMA
aragab-ib7:24224:24290 [5] NCCL INFO Channel 12/0 : 13[5] -> 4[4] [send] via NET/UCX/0/GDRDMA
...
or
...
aragab-ib10:12931:12936 [6] NCCL INFO Channel 06/0 : 166[6] -> 150[6] [send] via NET/IBext/2/GDRDMA
aragab-ib10:12931:12936 [6] NCCL INFO Channel 22/0 : 166[6] -> 150[6] [send] via NET/IBext/2/GDRDMA
...
Depending on whether you are using the UCX or IBext plugins provided by the NVIDIA HPCX libraries. The results of the 2 instance NCCL tests should look similar to the below output for each instance type:
- a100-80gb-sxm-ib.8x
- h100-80gb-sxm-ib.8x
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
8 2 float sum -1 35.11 0.00 0.00 0 34.63 0.00 0.00 0
16 4 float sum -1 34.29 0.00 0.00 0 34.70 0.00 0.00 0
32 8 float sum -1 34.90 0.00 0.00 0 34.66 0.00 0.00 0
64 16 float sum -1 33.03 0.00 0.00 0 33.21 0.00 0.00 0
128 32 float sum -1 33.58 0.00 0.01 0 33.44 0.00 0.01 0
256 64 float sum -1 33.81 0.01 0.01 0 33.70 0.01 0.01 0
512 128 float sum -1 36.15 0.01 0.03 0 36.11 0.01 0.03 0
1024 256 float sum -1 37.18 0.03 0.05 0 37.21 0.03 0.05 0
2048 512 float sum -1 38.58 0.05 0.10 0 38.08 0.05 0.10 0
4096 1024 float sum -1 39.94 0.10 0.19 0 39.50 0.10 0.19 0
8192 2048 float sum -1 42.07 0.19 0.37 0 41.52 0.20 0.37 0
16384 4096 float sum -1 44.84 0.37 0.69 0 42.15 0.39 0.73 0
32768 8192 float sum -1 49.97 0.66 1.23 0 42.10 0.78 1.46 0
65536 16384 float sum -1 50.16 1.31 2.45 0 52.09 1.26 2.36 0
131072 32768 float sum -1 64.25 2.04 3.83 0 62.40 2.10 3.94 0
262144 65536 float sum -1 68.07 3.85 7.22 0 66.84 3.92 7.35 0
524288 131072 float sum -1 74.99 6.99 13.11 0 72.28 7.25 13.60 0
1048576 262144 float sum -1 84.25 12.45 23.34 0 83.38 12.58 23.58 0
2097152 524288 float sum -1 109.4 19.18 35.95 0 125.8 16.67 31.25 0
4194304 1048576 float sum -1 141.9 29.56 55.43 0 156.0 26.89 50.42 0
8388608 2097152 float sum -1 203.3 41.27 77.37 0 201.2 41.68 78.16 0
16777216 4194304 float sum -1 419.8 39.97 74.94 0 270.1 62.11 116.45 0
33554432 8388608 float sum -1 493.4 68.01 127.51 0 493.6 67.98 127.46 0
67108864 16777216 float sum -1 910.9 73.67 138.14 0 911.2 73.65 138.09 0
134217728 33554432 float sum -1 1374.7 97.63 183.06 0 1371.5 97.86 183.48 0
268435456 67108864 float sum -1 2725.5 98.49 184.67 0 2672.7 100.44 188.32 0
536870912 134217728 float sum -1 5310.2 101.10 189.57 0 5297.2 101.35 190.03 0
1073741824 268435456 float sum -1 10475 102.50 192.19 0 10455 102.70 192.57 0
2147483648 536870912 float sum -1 20656 103.96 194.93 0 20685 103.82 194.66 0
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
8 2 float sum -1 31.73 0.00 0.00 0 31.70 0.00 0.00 0
16 4 float sum -1 31.58 0.00 0.00 0 31.57 0.00 0.00 0
32 8 float sum -1 31.74 0.00 0.00 0 31.66 0.00 0.00 0
64 16 float sum -1 30.27 0.00 0.00 0 30.20 0.00 0.00 0
128 32 float sum -1 30.53 0.00 0.01 0 30.45 0.00 0.01 0
256 64 float sum -1 40.39 0.01 0.01 0 30.70 0.01 0.02 0
512 128 float sum -1 32.01 0.02 0.03 0 31.41 0.02 0.03 0
1024 256 float sum -1 32.66 0.03 0.06 0 32.40 0.03 0.06 0
2048 512 float sum -1 33.82 0.06 0.11 0 33.48 0.06 0.11 0
4096 1024 float sum -1 35.14 0.12 0.22 0 34.90 0.12 0.22 0
8192 2048 float sum -1 36.06 0.23 0.43 0 35.70 0.23 0.43 0
16384 4096 float sum -1 38.19 0.43 0.80 0 37.51 0.44 0.82 0
32768 8192 float sum -1 39.62 0.83 1.55 0 38.80 0.84 1.58 0
65536 16384 float sum -1 41.42 1.58 2.97 0 40.81 1.61 3.01 0
131072 32768 float sum -1 45.61 2.87 5.39 0 45.04 2.91 5.46 0
262144 65536 float sum -1 53.40 4.91 9.20 0 53.37 4.91 9.21 0
524288 131072 float sum -1 60.98 8.60 16.12 0 60.53 8.66 16.24 0
1048576 262144 float sum -1 73.81 14.21 26.64 0 72.83 14.40 27.00 0
2097152 524288 float sum -1 103.1 20.34 38.13 0 102.7 20.42 38.28 0
4194304 1048576 float sum -1 123.0 34.11 63.96 0 119.5 35.09 65.80 0
8388608 2097152 float sum -1 137.3 61.10 114.56 0 123.4 67.98 127.45 0
16777216 4194304 float sum -1 195.6 85.78 160.83 0 192.1 87.35 163.78 0
33554432 8388608 float sum -1 544.4 61.64 115.58 0 419.7 79.95 149.90 0
67108864 16777216 float sum -1 468.6 143.22 268.53 0 469.1 143.06 268.23 0
134217728 33554432 float sum -1 779.9 172.09 322.67 0 776.6 172.83 324.05 0
268435456 67108864 float sum -1 1448.4 185.33 347.49 0 1447.9 185.40 347.63 0
536870912 134217728 float sum -1 2881.9 186.29 349.29 0 2879.8 186.43 349.56 0
1073741824 268435456 float sum -1 5674.6 189.22 354.78 0 5676.4 189.16 354.67 0
2147483648 536870912 float sum -1 11152 192.56 361.06 0 11127 193.00 361.87 0
If you are unable to reach your desired level of performance, please contact support.