Skip to main content

Topology

Topology arranges your CMK cluster by network topology. Each node appears as a tile within its InfiniBand (IB) pod grouping, with color-coded overlays for health status, GPU utilization, and CPU utilization. Topology is available for Crusoe Managed Kubernetes (CMK) with InfiniBand networks and will be available for Crusoe Virtual Machines (VMs) and for RoCE networks in a future release.

Accessing the Topology View

Navigate to Orchestration > select your cluster > Topology sub-tab.

Topology Layout

Nodes are organized by network connectivity:

  • InfiniBand (IB) pod grouping — Nodes on the same InfiniBand network are grouped by InfiniBand Pod ID and then by node pool name. Each IB pod supports up to 32 VMs (256 GPUs).
  • Non-IB grouping — CPU-only nodes and GPU nodes without InfiniBand (L40S, A40S) are grouped by node pool ID.

Overlay Modes

You can switch between three overlay modes using the controls at the top of the Topology view.

Health Status

note

Health Status is currently in preview. Contact Crusoe Cloud Support if you are interested in enabling this feature.

Each node displays a color-coded health status:

StatusColorDescription
HealthyGreenNode is operational and ready for workloads
DegradedYellowNode has a non-critical issue that may affect performance
UnhealthyRedNode has a critical hardware or software failure
UnknownBlueHealth status cannot be determined due to missing metrics

An aggregated cluster count by health status appears at the top of the view.

note

Health status combines Kubernetes node state and metrics (XID codes). A node may report as Running in Kubernetes but may be marked Degraded or Unhealthy in Topology based on GPU errors or resource pressure. Health status is distinct from AutoClusters remediation actions — a Degraded node may be in detect-only mode without triggering replacement.

Health Status Error Codes

Topology surfaces the following error conditions:

Degraded conditions:

Error CodeDescription
MemoryPressureHost memory exhaustion detected
DiskPressureLow disk space detected
PIDPressureProcess table nearing exhaustion
XID 119GPU System Processor (GSP) not responding to driver RPC requests
XID 120Driver failed to recover from GSP communication timeout
XID 140Unrecovered ECC error
XID 143GPU initialization failure
GPU mis-matchdetected GPU failure (allocatable < detected) or unavailability

Unhealthy conditions:

Error CodeDescription
Node Not ReadyKubernetes node state is not equal to true
XID 32Invalid or corrupted push buffer stream
XID 48Uncorrectable double-bit ECC memory error
XID 64GPU failed to record a memory error recovery action
XID 74NVLink interconnect error
XID 79GPU has fallen off the PCIe bus
XID 95Uncontained ECC error

GPU Utilization

The GPU utilization overlay displays a heatmap across all nodes:

Utilization RangeColorInterpretation
0–40%YellowIdle or starved — node may not be receiving work
40–80%BlueUnderutilized — potential bottleneck or inefficiency
80–100%GreenHealthy utilization — node is actively processing

Low GPU utilization on specific nodes often indicates stragglers slowing collective operations. Use this overlay to locate affected nodes, then investigate further in Metrics.

CPU Utilization

The CPU utilization overlay highlights nodes with high CPU load:

Utilization RangeColorInterpretation
0–70%GreenNormal operating range
70–90%BlueHigh load — monitor for potential contention
90–100%YellowSaturated — workloads may be CPU-bound

Node Details

Click any node tile to open a detail panel with the following information:

  • VM name
  • Current health status or GPU utilization or CPU utilization
  • VM state

You can perform the following actions from the node detail panel:

  • View historical metrics — Click Instance Details to navigate to node-level Metrics.
  • Generate NVIDIA bug report — Create an NVIDIA bug report (only available for Nvidia GPUs). Download it or attach to a support ticket.
  • Report an issue — Open a pre-filled support ticket with node information and latest NVIDIA bug report.

AutoClusters Integration

With AutoClusters enabled, Topology reflects remediation events in real time. Node pools undergoing replacement are marked as update in progress. New healthy nodes appear once replacement completes and metrics are collected. Remediation events also appear in Notification.

What's Next

  • Metrics — Drill into node-level performance data
  • Logs — Investigate system and application logs for specific nodes
  • Notification — Get notified about resource health via email and in-console, and set up alert routing to Slack or webhooks