Skip to main content

Troubleshooting

info

Crusoe Cloud is currently in private beta. If you do not currently have access, please request access to continue.

Issues with NVIDIA drivers

If you run nvidia-smi and get NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running. you are likely running into version compatibility issues with the NVIDIA drivers or have uninstalled the drivers.

You can verify this by running the following commands:

# Check kernel version
root@ubuntu:~$ uname -r
Linux ubuntu 5.4.0-1061-kvm

# Check installed NVIDIA driver version
root@ubuntu:~$ find /usr/lib/modules -name nvidia.ko
/usr/lib/modules/5.4.0-1055-kvm/kernel/drivers/video/nvidia.ko

In the above case, you can see that the VM is running 5.4.0-1061-kvm while the drivers are configured for 5.4.0-1055-kvm. You will need to update the NVIDIA drivers.

If you check the the driver version and the system can't find an installed driver, you will need to install the drivers.

Installing or updating NVIDIA drivers

You can install or update the NVIDIA drivers by running the following commands:

apt install ubuntu-drivers-common
sudo ubuntu-drivers --gpgpu install nvidia

If the above doesn't resolve your issues, please contact support and provide us additional details (including lshw) on what's not working.

GPU Troubleshooting

Crusoe Cloud provides full hardware passthrough of GPU instances directly into the Virtual Machines created. Because of this, any relevant logs to help troubleshoot GPU errors must be taken from inside the affected VM. The steps for capturing the specific logs are described below.

If you believe a hardware error has occurred, please capture these logs and submit them in a ticket by contacting support.

How to capture NVIDIA logs

Nvidia provides an ample set up debugging and error handling tools across their driver and software stack. These logs can help Crusoe Support teams ensure a speedy path to resolution. In all cases where a GPU error has occurred, it is important to retrieve an nvidia bug report, a query of nvidia-smi , and any Xid errors from dmesg logs. You can get the logs by running:

sudo nvidia-bug-report.sh
nvidia-smi -q
dmesg | grep Xid

If the bug report hangs - there might be a communication error on the NVIDIA driver itself in which case the client tools cannot communicate with the nvidia.ko kernel driver. If this is the case run the command with --safe-mode.

sudo nvidia-bug-report.sh --safe-mode

The NVLink and NVSwith layer have their own SXid error code stack. You can find Nvidia’s full documentation on their Fabric Manager here.

Known Failure Modes

There are a few known failure modes which automatically qualify the GPU to be degraded and replaced, which can be determined from the nvidia-bug-report.log.

Any uncorrectable ECC errors in SRAM in either Volatile or Aggregate > 0:

ECC Errors
Volatile
SRAM Correctable : 0
SRAM Uncorrectable : 1 <-- known failure mode
DRAM Correctable : 0
DRAM Uncorrectable : 0
Aggregate
SRAM Correctable : 0
SRAM Uncorrectable : 2 <-- known failure mode
DRAM Correctable : 0
DRAM Uncorrectable : 0

A row remapping failure occurred with no Pending Remapped rows:

Remapped Rows
Correctable Error : 0
Uncorrectable Error : 0
Pending : No
Remapping Failure Occurred : Yes <-- known failure mode
Bank Remap Availability Histogram
Max : 639 bank(s)
High : 0 bank(s)
Partial : 0 bank(s)
Low : 0 bank(s)
None : 1 bank(s)

If you find these errors, please submit the logs with a ticket in order for Crusoe teams to address and replace.

Row Remapping is Pending

This is a special case, where an error has occurred and the GPU is waiting to perform a row remapping event. The output of the nvidia-bug-report may look like:

Remapped Rows
Correctable Error : 0
Uncorrectable Error : 1
Pending : Yes <-- Notable
Remapping Failure Occurred : No
Bank Remap Availability Histogram
Max : 639 bank(s)
High : 0 bank(s)
Partial : 1 bank(s)
Low : 0 bank(s)
None : 0 bank(s)

To fix this, reset the GPU by running:

nvidia-smi -r

Once reset, reboot the instance to ensure the Row Remapping was successful. You should see the Pending return to “No” and Remapping Failure Occurred also return to“No”.