InfiniBand Metrics
Crusoe Cloud provides out-of-the-box InfiniBand (IB) metrics to help you monitor network performance, identify failures, and optimize utilization. These metricsare collected and published in 5-minute intervals, and are retained for 30 days. You can view IB metrics directly within the Crusoe Console under the VM Metrics view, or access them via our Prometheus-compatible query API. For detailed instructions on connecting via the API or Grafana, please refer to the VM metrics page.
| Metrics | Definition | Suggested Query |
|---|---|---|
| Line Rate | The maximum theoretical data transfer speed of the InfiniBand link, measured in Gigabits per second (Gb/s). | crusoe_ib_port_line_rate |
| InfiniBand Throughput Tx (bytes per second) | The rate of data transmitted from the port, measured in bytes per second. | rate(crusoe_ib_port_throughput_tx[5m]) |
| InfiniBand Throughput Rx (bytes per second) | The rate of data received by the port, measured in bytes per second. | rate(crusoe_ib_port_throughput_rx[5m]) |
| InfiniBand Throughput Tx (Packets per second) | The rate of packets transmitted from the port, measured in packets per second. | rate(crusoe_ib_port_packets_tx[5m]) |
| InfiniBand Throughput Rx (Packets per second) | The rate of packets received by the port, measured in packets per second. | rate(crusoe_ib_port_packets_rx[5m]) |
| InfiniBand Transmit Wait | The rate at which packets had to wait before being transmitted from the port, indicating a congestion or scheduling issue. | rate(crusoe_ib_port_tx_wait[5m]) |
| InfiniBand Link Downed | The rate of the InfiniBand link transitioned from an active state to a link-down state. | rate(crusoe_ib_port_link_downed[5m]) |
| InfiniBand Link Error Recovery | The rate of the link underwent an error recovery process to attempt to restore a healthy link state. | rate(crusoe_ib_port_link_error_recovery[5m]) |
| InfiniBand Port Constraint Errors Tx | The rate of errors transmitted due to link protocol or connectivity constraints, measured in packets per second. | rate(crusoe_ib_port_constraint_error_tx[5m]) |
| InfiniBand Port Constraint Errors Rx | The rate of errors received due to link protocol or connectivity constraints, measured in packets per second. | rate(crusoe_ib_port_constraint_error_rx[5m]) |
| InfiniBand Port Errors Rx | The rate of received packets with errors at the port level, typically indicating issues like bad Cyclic Redundancy Check (CRC) errors. | rate(crusoe_ib_port_error_rx[5m]) |
| InfiniBand Symbol Errors | The rate of low-level physical error bits received where the receiver detected an invalid data symbol, measured in bits per second. | rate(crusoe_ib_port_bits_error[5m]) |