Setting up IMEX
Setting up and Validating IMEX
NVIDIA's IMEX service supports GPU memory export and import (NVLink P2P) and shared memory operations across OS domains in a NVLink multi-node deployment. On Crusoe Cloud this is relevant for rack scale solutions like the GB200.
IMEX is facilitated by the daemon called nvidia-imex.service. What this does is:
- It manages the GPU sharing lifecycle & runs on the compute nodes.
- Operates below the application level and outside the CUDA/NCCL stack.
- Communicates via TCP/IP or gRPC connections.
This page provides a step by step guide to setting up and validating that IMEX is working correctly.
Step 1: Check IMEX service status to ensure it is enabled
nvidia-imex-ctl
& nvidia-imex
binaries are already available on the GB200 VM Image. To check the status, run the following command.
sudo systemctl status nvidia-imex
If the service is not enabled, you can enable it using the following command.
sudo systemctl enable nvidia-imex
Step 2: Create node_config with compute tray IP address
Create the file /etc/nvidia-imex/nodes_config.cfg
and add all the private IPs of the GB200 VMs in the NVLink domain. An example format for the file is shown below
172.27.58.70
172.27.58.71
172.27.58.72
172.27.58.73
After updating the config file with all VM IPs, restart the IMEX service
sudo systemctl stop nvidia-imex
sudo systemctl start nvidia-imex
Note: The IMEX config file is read as part of the IMEX service startup process. If you change the config file options, for the new settings to take effect you will need to restart the IMEX service.
Step 3: Create IMEX Channels
IMEX channels are a GPU driver feature that allows for user-based memory isolation in a multi-user environment within an IMEX domain.
Create a channel that allows user based memory isolation in the multi-node setup.
# Get the major number & create the default channel. In the following example output, 234 is the major number
$ grep nvidia-caps-imex-channels /proc/devices
234 nvidia-caps-imex-channels
# Create the channel. THIS WILL HAVE TO BE CREATED EVERYTIME AT REBOOT due /dev reset.
sudo mkdir /dev/nvidia-caps-imex-channels/
sudo mknod /dev/nvidia-caps-imex-channels/channel0 c <major number> 0
# FOR CREATING A PERSISTENT CHANNEL POST REBOOT
$ vi /etc/modprobe.d/nvidia.conf$
options nvidia NVreg_CreateImexChannel0=1 # <== ADD THIS OPTION AND SAVE
# REGENERATE initramfs
sudo update-initramfs -u
sudo reboot
Step 4: Check whole rack connectivity
Check the connectivity of the whole rack using
nvidia-imex-ctl -N
If 'C' (Connected) appears for all nodes, then the rack connectivity check has passed.
If 'I' (Invalid) if there is an error. Run sudo journalctl -u nvidia-imex
to view the IMEX logs.
If all nodes are connected the following table will be returned
Nodes From\To 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
0 C C C C C C C C C C C C C C C C C C
1 C C C C C C C C C C C C C C C C C C
2 C C C C C C C C C C C C C C C C C C
3 C C C C C C C C C C C C C C C C C C
4 C C C C C C C C C C C C C C C C C C
5 C C C C C C C C C C C C C C C C C C
6 C C C C C C C C C C C C C C C C C C
7 C C C C C C C C C C C C C C C C C C
8 C C C C C C C C C C C C C C C C C C
9 C C C C C C C C C C C C C C C C C C
10 C C C C C C C C C C C C C C C C C C
11 C C C C C C C C C C C C C C C C C C
12 C C C C C C C C C C C C C C C C C C
13 C C C C C C C C C C C C C C C C C C
14 C C C C C C C C C C C C C C C C C C
15 C C C C C C C C C C C C C C C C C C
16 C C C C C C C C C C C C C C C C C C
17 C C C C C C C C C C C C C C C C C C
For any issues sudo journalctl -u nvidia-imex
can be run to view the IMEX logs.