Topology
Topology arranges your CMK cluster by network topology. Each node appears as a tile within its InfiniBand (IB) pod grouping, with color-coded overlays for health status, GPU utilization, and CPU utilization. Topology is available for Crusoe Managed Kubernetes (CMK) with InfiniBand networks and will be available for Crusoe Virtual Machines (VMs) and for RoCE networks in a future release.
Accessing the Topology View
Navigate to Orchestration > select your cluster > Topology sub-tab.
Topology Layout
Nodes are organized by network connectivity:
- InfiniBand (IB) pod grouping — Nodes on the same InfiniBand network are grouped by InfiniBand Pod ID and then by node pool name. Each IB pod supports up to 32 VMs (256 GPUs).
- Non-IB grouping — CPU-only nodes and GPU nodes without InfiniBand (L40S, A40S) are grouped by node pool ID.
Overlay Modes
You can switch between three overlay modes using the controls at the top of the Topology view.
Health Status
Health Status is currently in preview. Contact Crusoe Cloud Support if you are interested in enabling this feature.
Each node displays a color-coded health status:
| Status | Color | Description |
|---|---|---|
| Healthy | Green | Node is operational and ready for workloads |
| Degraded | Yellow | Node has a non-critical issue that may affect performance |
| Unhealthy | Red | Node has a critical hardware or software failure |
| Unknown | Blue | Health status cannot be determined due to missing metrics |
An aggregated cluster count by health status appears at the top of the view.
Health status combines Kubernetes node state and metrics (XID codes). A node may report as Running in Kubernetes but may be marked Degraded or Unhealthy in Topology based on GPU errors or resource pressure. Health status is distinct from AutoClusters remediation actions — a Degraded node may be in detect-only mode without triggering replacement.
Health Status Error Codes
Topology surfaces the following error conditions:
Degraded conditions:
| Error Code | Description |
|---|---|
| MemoryPressure | Host memory exhaustion detected |
| DiskPressure | Low disk space detected |
| PIDPressure | Process table nearing exhaustion |
| XID 119 | GPU System Processor (GSP) not responding to driver RPC requests |
| XID 120 | Driver failed to recover from GSP communication timeout |
| XID 140 | Unrecovered ECC error |
| XID 143 | GPU initialization failure |
| GPU mis-match | detected GPU failure (allocatable < detected) or unavailability |
Unhealthy conditions:
| Error Code | Description |
|---|---|
| Node Not Ready | Kubernetes node state is not equal to true |
| XID 32 | Invalid or corrupted push buffer stream |
| XID 48 | Uncorrectable double-bit ECC memory error |
| XID 64 | GPU failed to record a memory error recovery action |
| XID 74 | NVLink interconnect error |
| XID 79 | GPU has fallen off the PCIe bus |
| XID 95 | Uncontained ECC error |
GPU Utilization
The GPU utilization overlay displays a heatmap across all nodes:
| Utilization Range | Color | Interpretation |
|---|---|---|
| 0–40% | Yellow | Idle or starved — node may not be receiving work |
| 40–80% | Blue | Underutilized — potential bottleneck or inefficiency |
| 80–100% | Green | Healthy utilization — node is actively processing |
Low GPU utilization on specific nodes often indicates stragglers slowing collective operations. Use this overlay to locate affected nodes, then investigate further in Metrics.
CPU Utilization
The CPU utilization overlay highlights nodes with high CPU load:
| Utilization Range | Color | Interpretation |
|---|---|---|
| 0–70% | Green | Normal operating range |
| 70–90% | Blue | High load — monitor for potential contention |
| 90–100% | Yellow | Saturated — workloads may be CPU-bound |
Node Details
Click any node tile to open a detail panel with the following information:
- VM name
- Current health status or GPU utilization or CPU utilization
- VM state
You can perform the following actions from the node detail panel:
- View historical metrics — Click Instance Details to navigate to node-level Metrics.
- Generate NVIDIA bug report — Create an NVIDIA bug report (only available for Nvidia GPUs). Download it or attach to a support ticket.
- Report an issue — Open a pre-filled support ticket with node information and latest NVIDIA bug report.
AutoClusters Integration
With AutoClusters enabled, Topology reflects remediation events in real time. Node pools undergoing replacement are marked as update in progress. New healthy nodes appear once replacement completes and metrics are collected. Remediation events also appear in Notification.
What's Next
- Metrics — Drill into node-level performance data
- Logs — Investigate system and application logs for specific nodes
- Notification — Get notified about resource health via email and in-console, and set up alert routing to Slack or webhooks