Topology

Topology arranges your CMK cluster by network topology. Each node appears as a tile within its InfiniBand (IB) pod grouping, with color-coded overlays for health status, GPU utilization, and CPU utilization. Topology is available for Crusoe Managed Kubernetes (CMK) with InfiniBand networks and will be available for Crusoe Virtual Machines (VMs) and for RoCE networks in a future release.

Accessing the Topology View

Navigate to Orchestration > select your cluster > Topology sub-tab.

Topology Layout

Nodes are organized by network connectivity:

InfiniBand (IB) pod grouping — Nodes on the same InfiniBand network are grouped by InfiniBand Pod ID and then by node pool name. Each IB pod supports up to 32 VMs (256 GPUs).
Non-IB grouping — CPU-only nodes and GPU nodes without InfiniBand (L40S, A40S) are grouped by node pool ID.

Overlay Modes

You can switch between three overlay modes using the controls at the top of the Topology view.

Health Status

note

Health Status is currently in preview. Contact Crusoe Cloud Support if you are interested in enabling this feature.

Each node displays a color-coded health status:

Status	Color	Description
Healthy	Green	Node is operational and ready for workloads
Degraded	Yellow	Node has a non-critical issue that may affect performance
Unhealthy	Red	Node has a critical hardware or software failure
Unknown	Blue	Health status cannot be determined due to missing metrics

An aggregated cluster count by health status appears at the top of the view.

note

Health status combines Kubernetes node state and metrics (XID codes). A node may report as Running in Kubernetes but may be marked Degraded or Unhealthy in Topology based on GPU errors or resource pressure. Health status is distinct from AutoClusters remediation actions — a Degraded node may be in detect-only mode without triggering replacement.

Health Status Error Codes

Topology surfaces the following error conditions:

Degraded conditions:

Error Code	Description
MemoryPressure	Host memory exhaustion detected
DiskPressure	Low disk space detected
PIDPressure	Process table nearing exhaustion
XID 119	GPU System Processor (GSP) not responding to driver RPC requests
XID 120	Driver failed to recover from GSP communication timeout
XID 140	Unrecovered ECC error
XID 143	GPU initialization failure
GPU mis-match	detected GPU failure (allocatable < detected) or unavailability

Unhealthy conditions:

Error Code	Description
Node Not Ready	Kubernetes node state is not equal to true
XID 32	Invalid or corrupted push buffer stream
XID 48	Uncorrectable double-bit ECC memory error
XID 64	GPU failed to record a memory error recovery action
XID 74	NVLink interconnect error
XID 79	GPU has fallen off the PCIe bus
XID 95	Uncontained ECC error

GPU Utilization

The GPU utilization overlay displays a heatmap across all nodes:

Utilization Range	Color	Interpretation
0–40%	Yellow	Idle or starved — node may not be receiving work
40–80%	Blue	Underutilized — potential bottleneck or inefficiency
80–100%	Green	Healthy utilization — node is actively processing

Low GPU utilization on specific nodes often indicates stragglers slowing collective operations. Use this overlay to locate affected nodes, then investigate further in Metrics.

CPU Utilization

The CPU utilization overlay highlights nodes with high CPU load:

Utilization Range	Color	Interpretation
0–70%	Green	Normal operating range
70–90%	Blue	High load — monitor for potential contention
90–100%	Yellow	Saturated — workloads may be CPU-bound

Node Details

Click any node tile to open a detail panel with the following information:

VM name
Current health status or GPU utilization or CPU utilization
VM state

You can perform the following actions from the node detail panel:

View historical metrics — Click Instance Details to navigate to node-level Metrics.
Generate NVIDIA bug report — Create an NVIDIA bug report (only available for Nvidia GPUs). Download it or attach to a support ticket.
Report an issue — Open a pre-filled support ticket with node information and latest NVIDIA bug report.

AutoClusters Integration

With AutoClusters enabled, Topology reflects remediation events in real time. Node pools undergoing replacement are marked as update in progress. New healthy nodes appear once replacement completes and metrics are collected. Remediation events also appear in Notifications.

What's Next

Metrics — Drill into node-level performance data
Logs — Investigate system and application logs for specific nodes
Notification — Get notified about resource health via email and in-console, and set up alert routing to Slack or webhooks

Topology

Accessing the Topology View​

Topology Layout​

Overlay Modes​

Health Status​

Health Status Error Codes​

GPU Utilization​

CPU Utilization​

Node Details​

AutoClusters Integration​

What's Next​