Troubleshooting
Crusoe Cloud is currently in private beta. If you do not currently have access, please request access to continue.
Issues with NVIDIA drivers
If you run nvidia-smi
and get NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
you are likely running into version compatibility issues with the NVIDIA drivers or have uninstalled the drivers.
You can verify this by running the following commands:
# Check kernel version
root@ubuntu:~$ uname -r
Linux ubuntu 5.4.0-1061-kvm
# Check installed NVIDIA driver version
root@ubuntu:~$ find /usr/lib/modules -name nvidia.ko
/usr/lib/modules/5.4.0-1055-kvm/kernel/drivers/video/nvidia.ko
In the above case, you can see that the VM is running 5.4.0-1061-kvm
while the drivers are configured for 5.4.0-1055-kvm
. You will need to update the NVIDIA drivers.
If you check the the driver version and the system can't find an installed driver, you will need to install the drivers.
Installing or updating NVIDIA drivers
You can install or update the NVIDIA drivers by running the following commands:
apt install ubuntu-drivers-common
sudo ubuntu-drivers --gpgpu install nvidia
If the above doesn't resolve your issues, please contact support and provide us additional details (including lshw
) on what's not working.
GPU Troubleshooting
Crusoe Cloud provides full hardware passthrough of GPU instances directly into the Virtual Machines created. Because of this, any relevant logs to help troubleshoot GPU errors must be taken from inside the affected VM. The steps for capturing the specific logs are described below.
If you believe a hardware error has occurred, please capture these logs and submit them in a ticket by contacting support.
How to capture NVIDIA logs
Nvidia provides an ample set up debugging and error handling tools across their driver and software stack. These logs can help Crusoe Support teams ensure a speedy path to resolution. In all cases where a GPU error has occurred, it is important to retrieve an nvidia bug report, a query of nvidia-smi
, and any Xid
errors from dmesg logs. You can get the logs by running:
sudo nvidia-bug-report.sh
nvidia-smi -q
dmesg | grep Xid
If the bug report hangs - there might be a communication error on the NVIDIA driver itself in which case the client tools cannot communicate with the nvidia.ko
kernel driver. If this is the case run the command with --safe-mode
.
sudo nvidia-bug-report.sh --safe-mode
The NVLink and NVSwith layer have their own SXid
error code stack. You can find Nvidia’s full documentation on their Fabric Manager here.
Known Failure Modes
There are a few known failure modes which automatically qualify the GPU to be degraded and replaced, which can be determined from the nvidia-bug-report.log
.
Any uncorrectable ECC errors in SRAM in either Volatile
or Aggregate
> 0:
ECC Errors
Volatile
SRAM Correctable : 0
SRAM Uncorrectable : 1 <-- known failure mode
DRAM Correctable : 0
DRAM Uncorrectable : 0
Aggregate
SRAM Correctable : 0
SRAM Uncorrectable : 2 <-- known failure mode
DRAM Correctable : 0
DRAM Uncorrectable : 0
A row remapping failure occurred with no Pending
Remapped rows:
Remapped Rows
Correctable Error : 0
Uncorrectable Error : 0
Pending : No
Remapping Failure Occurred : Yes <-- known failure mode
Bank Remap Availability Histogram
Max : 639 bank(s)
High : 0 bank(s)
Partial : 0 bank(s)
Low : 0 bank(s)
None : 1 bank(s)
If you find these errors, please submit the logs with a ticket in order for Crusoe teams to address and replace.
Row Remapping is Pending
This is a special case, where an error has occurred and the GPU is waiting to perform a row remapping event. The output of the nvidia-bug-report
may look like:
Remapped Rows
Correctable Error : 0
Uncorrectable Error : 1
Pending : Yes <-- Notable
Remapping Failure Occurred : No
Bank Remap Availability Histogram
Max : 639 bank(s)
High : 0 bank(s)
Partial : 1 bank(s)
Low : 0 bank(s)
None : 0 bank(s)
To fix this, reset the GPU by running:
nvidia-smi -r
Once reset, reboot the instance to ensure the Row Remapping was successful. You should see the Pending
return to “No” and Remapping Failure Occurred
also return to“No”.
Validating Network Performance
This section provides guidance on how you can validate the networking performance between two Crusoe VMs using open source tools.
Ensure that there are no firewalls or other network restrictions blocking traffic between the virtual machines, as this can affect the results. Additionally, if the virtual machines are on different networks, you may need to configure routing or VPNs to allow communication between them.
Validating Network Performance Quick Start
- Install
iperf3
Make sure iperf3 is installed on both virtual machines. To install iperf3
, run the command below:
apt-get install iperf3 -y
To confirm iperf3
installed, run the command:
iperf3 --version
- Determine IP addresses
Locate the Private IP addresses of both virtual machines. You are going to need to have these Private IP addresses handy for use in the later steps. You can find these in the Crusoe Console Instances tab, or through the Crusoe CLI.
- Start the
iperf3
server
Choose one of the virtual machines to act as the server.
SSH into the virtual machine and run the following command to start the iperf3 server:
iperf3 -sD
This command tells iperf3
to start in server and daemon mode as a detached process.
- Run the
iperf3
client
On the other virtual machine, SSH into the virtual machine and run the following command to start the iperf client and connect to the server:
iperf3 -c <private_server_ip> -t 60
Replace <private_server_ip>
with the Private IP address of the virtual machine running the iperf3 server. The -t 60
sets the duration of the test to 60 seconds.
- View the results
Once the test is complete, you'll see the results on the client terminal.
Connecting to host 172.27.46.238, port 5201
[ 5] local 172.27.43.3 port 56388 connected to 172.27.46.238 port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 4.56 GBytes 39.2 Gbits/sec 1341 895 KBytes
[ 5] 1.00-2.00 sec 4.84 GBytes 41.6 Gbits/sec 946 907 KBytes
[ 5] 2.00-3.00 sec 4.65 GBytes 39.9 Gbits/sec 242 6.95 MBytes
[ 5] 3.00-4.00 sec 4.16 GBytes 35.8 Gbits/sec 3437 7.55 MBytes
[ 5] 4.00-5.00 sec 4.17 GBytes 35.9 Gbits/sec 2772 7.43 MBytes
[ 5] 5.00-6.00 sec 4.16 GBytes 35.8 Gbits/sec 4885 4.82 MBytes
[ 5] 6.00-7.00 sec 3.98 GBytes 34.2 Gbits/sec 2574 4.85 MBytes
[ 5] 7.00-8.00 sec 3.91 GBytes 33.5 Gbits/sec 983 4.97 MBytes
[ 5] 8.00-9.00 sec 4.09 GBytes 35.2 Gbits/sec 0 7.42 MBytes
[ 5] 9.00-10.00 sec 4.21 GBytes 36.2 Gbits/sec 6009 7.46 MBytes
[ 5] 10.00-11.00 sec 4.35 GBytes 37.4 Gbits/sec 0 7.49 MBytes
[ 5] 11.00-12.00 sec 4.12 GBytes 35.4 Gbits/sec 0 7.65 MBytes
[ 5] 12.00-13.00 sec 3.65 GBytes 31.3 Gbits/sec 0 5.81 MBytes
[ 5] 13.00-14.00 sec 4.24 GBytes 36.4 Gbits/sec 0 7.54 MBytes
[ 5] 14.00-15.00 sec 4.30 GBytes 37.0 Gbits/sec 627 7.32 MBytes
[ 5] 15.00-16.00 sec 4.13 GBytes 35.4 Gbits/sec 4547 4.09 MBytes
[ 5] 16.00-17.00 sec 4.26 GBytes 36.6 Gbits/sec 0 7.45 MBytes
[ 5] 17.00-18.00 sec 4.18 GBytes 35.9 Gbits/sec 3035 7.43 MBytes
[ 5] 18.00-19.00 sec 4.13 GBytes 35.5 Gbits/sec 1392 7.40 MBytes
[ 5] 19.00-20.00 sec 4.22 GBytes 36.2 Gbits/sec 0 7.53 MBytes
[ 5] 20.00-21.00 sec 4.28 GBytes 36.8 Gbits/sec 0 7.32 MBytes
[ 5] 21.00-22.00 sec 4.18 GBytes 35.9 Gbits/sec 0 7.45 MBytes
[ 5] 22.00-23.00 sec 3.61 GBytes 31.1 Gbits/sec 0 810 KBytes
[ 5] 23.00-24.00 sec 4.12 GBytes 35.4 Gbits/sec 2291 6.03 MBytes
[ 5] 24.00-25.00 sec 4.25 GBytes 36.5 Gbits/sec 0 7.57 MBytes
[ 5] 25.00-26.00 sec 4.31 GBytes 37.0 Gbits/sec 0 7.45 MBytes
[ 5] 26.00-27.00 sec 4.27 GBytes 36.7 Gbits/sec 3651 3.70 MBytes
[ 5] 27.00-28.00 sec 4.11 GBytes 35.3 Gbits/sec 0 7.46 MBytes
[ 5] 28.00-29.00 sec 4.18 GBytes 35.9 Gbits/sec 2230 7.46 MBytes
[ 5] 29.00-30.00 sec 4.06 GBytes 34.8 Gbits/sec 3250 7.60 MBytes
[ 5] 30.00-31.00 sec 4.30 GBytes 36.9 Gbits/sec 0 7.48 MBytes
[ 5] 31.00-32.00 sec 4.42 GBytes 38.0 Gbits/sec 0 7.50 MBytes
[ 5] 32.00-33.00 sec 3.34 GBytes 28.7 Gbits/sec 0 930 KBytes
[ 5] 33.00-34.00 sec 4.13 GBytes 35.5 Gbits/sec 0 7.36 MBytes
[ 5] 34.00-35.00 sec 4.25 GBytes 36.5 Gbits/sec 0 7.24 MBytes
[ 5] 35.00-36.00 sec 4.13 GBytes 35.5 Gbits/sec 6558 7.39 MBytes
[ 5] 36.00-37.00 sec 4.15 GBytes 35.6 Gbits/sec 0 7.14 MBytes
[ 5] 37.00-38.00 sec 4.02 GBytes 34.5 Gbits/sec 2551 6.82 MBytes
[ 5] 38.00-39.00 sec 4.13 GBytes 35.5 Gbits/sec 1770 5.73 MBytes
[ 5] 39.00-40.00 sec 3.96 GBytes 34.0 Gbits/sec 0 7.41 MBytes
[ 5] 40.00-41.00 sec 3.97 GBytes 34.1 Gbits/sec 468 5.06 MBytes
[ 5] 41.00-42.00 sec 3.84 GBytes 33.0 Gbits/sec 1541 5.02 MBytes
[ 5] 42.00-43.00 sec 3.24 GBytes 27.8 Gbits/sec 0 898 KBytes
[ 5] 43.00-44.00 sec 4.79 GBytes 41.2 Gbits/sec 0 727 KBytes
[ 5] 44.00-45.00 sec 4.72 GBytes 40.6 Gbits/sec 0 816 KBytes
[ 5] 45.00-46.00 sec 4.40 GBytes 37.8 Gbits/sec 0 7.45 MBytes
[ 5] 46.00-47.00 sec 4.15 GBytes 35.7 Gbits/sec 6635 7.43 MBytes
[ 5] 47.00-48.00 sec 4.27 GBytes 36.7 Gbits/sec 0 7.47 MBytes
[ 5] 48.00-49.00 sec 4.20 GBytes 36.1 Gbits/sec 1667 7.40 MBytes
[ 5] 49.00-50.00 sec 4.22 GBytes 36.2 Gbits/sec 0 7.43 MBytes
[ 5] 50.00-51.00 sec 4.18 GBytes 35.9 Gbits/sec 769 7.54 MBytes
[ 5] 51.00-52.00 sec 4.18 GBytes 35.9 Gbits/sec 314 7.60 MBytes
[ 5] 52.00-53.00 sec 4.11 GBytes 35.3 Gbits/sec 2802 5.70 KBytes
[ 5] 53.00-54.00 sec 3.46 GBytes 29.7 Gbits/sec 0 7.35 MBytes
[ 5] 54.00-55.00 sec 4.07 GBytes 35.0 Gbits/sec 785 7.40 MBytes
[ 5] 55.00-56.00 sec 4.22 GBytes 36.2 Gbits/sec 2956 7.58 MBytes
[ 5] 56.00-57.00 sec 4.12 GBytes 35.4 Gbits/sec 5550 7.49 MBytes
[ 5] 57.00-58.00 sec 4.20 GBytes 36.1 Gbits/sec 0 7.46 MBytes
[ 5] 58.00-59.00 sec 4.16 GBytes 35.8 Gbits/sec 1221 7.49 MBytes
[ 5] 59.00-60.00 sec 4.09 GBytes 35.1 Gbits/sec 2433 7.74 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-60.00 sec 249 GBytes 35.7 Gbits/sec 82232 sender
[ 5] 0.00-60.04 sec 249 GBytes 35.6 Gbits/sec receiver
Results will vary depending on the Instance type you are running the server and client on. More information can be found on the VM Specifications
If you are unable to reach your desired level of performance, please contact support.