Setting up IMEX
Setting up and Validating IMEX
NVIDIA's IMEX service supports GPU memory export and import (NVLink P2P) and shared memory operations across OS domains in a NVLink multi-node deployment. On Crusoe Cloud this is relevant for rack scale solutions like the GB200.
IMEX is facilitated by the daemon called nvidia-imex.service. What this does is:
- It manages the GPU sharing lifecycle & runs on the compute nodes.
- Operates below the application level and outside the CUDA/NCCL stack.
- Communicates via TCP/IP or gRPC connections.
This page provides a step by step guide to setting up and validating that IMEX is working correctly.
Step 1: Check IMEX status and service
nvidia-imex-ctl
& nvidia-imex
service are already available on the VM Image. To check the status run the following commands.
sudo systemctl status nvidia-imex
sudo systemctl stop nvidia-imex
sudo systemctl start nvidia-imex
Ensure that the service boots.
sudo systemctl enable nvidia-imex
Step 2: Check whole rack connectivity
Check the connectivity of the whole rack using
nvidia-imex-ctl -N
If 'C' (Connected) appears for all nodes, then the rack connectivity check has passed.
If 'I' (Invalid) if there is an error. Run sudo journalctl -u nvidia-imex
to view the IMEX logs.
Once connectivity check has passed:
- Ensure /etc/nvidia-imex directory exists
- Get the IPs of all the 18 trays on the rack. These will be utilized in step 3.
- Ensure all of them have passwordless authentication.
Step 3: Create node_config with compute tray IP address
Create a file in /etc/nvidia-imex/nodes_config.cfg
and add all the tray IPs from step 2 to it.
After updating the config file with all Tray IPs, reset the IMEX service
sudo systemctl stop nvidia-imex
sudo systemctl start nvidia-imex
Note: The IMEX config file is read as part of the IMEX service startup process. If you change the config file options, for the new settings to take effect you will need to restart the IMEX service.
Step 4: Create IMEX Channels
IMEX channels are a GPU driver feature that allows for user-based memory isolation in a multi-user environment within an IMEX domain.
Create a channel that allows user based memory isolation in the multi-node setup.
#Get the major number & create the default channel
grep nvidia-caps-imex-channels /proc/devices
# Create the channel. THIS WILL HAVE TO BE CREATED EVERYTIME AT REBOOT due /dev reset.
sudo mkdir /dev/nvidia-caps-imex-channels/
sudo mknod /dev/nvidia-caps-imex-channels/channel0 c <major number> 0
# FOR CREATING A PERSISTENT CHANNEL POST REBOOT
vi /etc/modprobe.d/nvidia.conf
options nvidia NVreg_CreateImexChannel0=1 # <== ADD THIS OPTION
sudo update-initramfs -u
sudo reboot
Step 5: Check logs and node connectivity.
To verify that all nodes are connected run nvidia-imex-ctl -N
. If all nodes are connected the following table will be returned
Nodes From\To 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
0 C C C C C C C C C C C C C C C C C C
1 C C C C C C C C C C C C C C C C C C
2 C C C C C C C C C C C C C C C C C C
3 C C C C C C C C C C C C C C C C C C
4 C C C C C C C C C C C C C C C C C C
5 C C C C C C C C C C C C C C C C C C
6 C C C C C C C C C C C C C C C C C C
7 C C C C C C C C C C C C C C C C C C
8 C C C C C C C C C C C C C C C C C C
9 C C C C C C C C C C C C C C C C C C
10 C C C C C C C C C C C C C C C C C C
11 C C C C C C C C C C C C C C C C C C
12 C C C C C C C C C C C C C C C C C C
13 C C C C C C C C C C C C C C C C C C
14 C C C C C C C C C C C C C C C C C C
15 C C C C C C C C C C C C C C C C C C
16 C C C C C C C C C C C C C C C C C C
17 C C C C C C C C C C C C C C C C C C
For any issues sudo journalctl -u nvidia-imex
can be run to view the IMEX logs.