Skip to main content

Setting up IMEX

Setting up and Validating IMEX

NVIDIA's IMEX service supports GPU memory export and import (NVLink P2P) and shared memory operations across OS domains in a NVLink multi-node deployment. On Crusoe Cloud this is relevant for rack scale solutions like the GB200.

IMEX is facilitated by the daemon called nvidia-imex.service. What this does is:

  • It manages the GPU sharing lifecycle & runs on the compute nodes.
  • Operates below the application level and outside the CUDA/NCCL stack.
  • Communicates via TCP/IP or gRPC connections.

This page provides a step by step guide to setting up and validating that IMEX is working correctly.

Step 1: Check IMEX service status to ensure it is enabled

nvidia-imex-ctl & nvidia-imex binaries are already available on the GB200 VM Image. To check the status, run the following command.

sudo systemctl status nvidia-imex

If the service is not enabled, you can enable it using the following command.

sudo systemctl enable nvidia-imex

Step 2: Create node_config with compute tray IP address

Create the file /etc/nvidia-imex/nodes_config.cfg and add all the private IPs of the GB200 VMs in the NVLink domain. An example format for the file is shown below

172.27.58.70
172.27.58.71
172.27.58.72
172.27.58.73

After updating the config file with all VM IPs, restart the IMEX service

sudo systemctl stop nvidia-imex
sudo systemctl start nvidia-imex

Note: The IMEX config file is read as part of the IMEX service startup process. If you change the config file options, for the new settings to take effect you will need to restart the IMEX service.

Step 3: Create IMEX Channels

IMEX channels are a GPU driver feature that allows for user-based memory isolation in a multi-user environment within an IMEX domain.

Create a channel that allows user based memory isolation in the multi-node setup.

# Get the major number & create the default channel. In the following example output, 234 is the major number
$ grep nvidia-caps-imex-channels /proc/devices
234 nvidia-caps-imex-channels

# Create the channel. THIS WILL HAVE TO BE CREATED EVERYTIME AT REBOOT due /dev reset.
sudo mkdir /dev/nvidia-caps-imex-channels/
sudo mknod /dev/nvidia-caps-imex-channels/channel0 c <major number> 0

# FOR CREATING A PERSISTENT CHANNEL POST REBOOT
$ vi /etc/modprobe.d/nvidia.conf$
options nvidia NVreg_CreateImexChannel0=1 # <== ADD THIS OPTION AND SAVE

# REGENERATE initramfs
sudo update-initramfs -u
sudo reboot

Step 4: Check whole rack connectivity

Check the connectivity of the whole rack using

nvidia-imex-ctl -N

If 'C' (Connected) appears for all nodes, then the rack connectivity check has passed.

If 'I' (Invalid) if there is an error. Run sudo journalctl -u nvidia-imex to view the IMEX logs.

If all nodes are connected the following table will be returned

Nodes From\To  0   1   2   3   4   5   6   7   8   9   10  11  12  13  14  15  16  17
0 C C C C C C C C C C C C C C C C C C
1 C C C C C C C C C C C C C C C C C C
2 C C C C C C C C C C C C C C C C C C
3 C C C C C C C C C C C C C C C C C C
4 C C C C C C C C C C C C C C C C C C
5 C C C C C C C C C C C C C C C C C C
6 C C C C C C C C C C C C C C C C C C
7 C C C C C C C C C C C C C C C C C C
8 C C C C C C C C C C C C C C C C C C
9 C C C C C C C C C C C C C C C C C C
10 C C C C C C C C C C C C C C C C C C
11 C C C C C C C C C C C C C C C C C C
12 C C C C C C C C C C C C C C C C C C
13 C C C C C C C C C C C C C C C C C C
14 C C C C C C C C C C C C C C C C C C
15 C C C C C C C C C C C C C C C C C C
16 C C C C C C C C C C C C C C C C C C
17 C C C C C C C C C C C C C C C C C C

For any issues sudo journalctl -u nvidia-imex can be run to view the IMEX logs.