Infrastructure and Topology Overview
Command Center provides two complementary views for monitoring your infrastructure at a glance: Infrastructure Overview and Topology. Both surfaces show GPU utilization and instance health, organized at different levels of granularity.
Infrastructure Overview
When you open Command Center, you land on the Infrastructure Overview page: a project-level summary of all compute resources, grouped by instance type such as H100, GB200, and MI355X. Whether your resources are standalone VMs or CMK clusters, they're organized together by hardware type—for example, instances running on the same network rail (e.g., NVL72) appear in a single tile, regardless of whether they run as VMs or CMK cluster nodes.
Each tile shows:
- Node count
- Average GPU utilization
- P95 GPU power draw
- P95 GPU temperature
- InfiniBand throughput
- Instance counts by health status
Click a tile to drill into cluster metrics or VM metrics for that instance type. Infrastructure Overview requires the Crusoe Watch Agent and is available by default in the Crusoe Console.
To access, navigate to Command Center in the left navigation > select Infra Overview.
Topology
Topology arranges your CMK cluster by network topology. Each node appears as a tile within its InfiniBand (IB) pod grouping, with color-coded overlays for health status, GPU utilization, and CPU utilization. Topology is available for Crusoe Managed Kubernetes (CMK) with InfiniBand networks and will be available for Crusoe Virtual Machines (VMs) and for RoCE networks in a future release.
Accessing the Topology View
Navigate to Orchestration > select your cluster > Topology sub-tab.
Topology Layout
Nodes are organized by network connectivity:
- InfiniBand (IB) pod grouping — Nodes on the same InfiniBand network are grouped by InfiniBand Pod ID and then by node pool name. Each IB pod supports up to 32 VMs (256 GPUs).
- Non-IB grouping — CPU-only nodes and GPU nodes without InfiniBand (L40S, A40S) are grouped by node pool ID.
Overlay Modes
You can switch between three overlay modes using the controls at the top of the Topology view.
Health Status
Each node displays a color-coded health status. An aggregated cluster count by health status appears at the top of the view. For status definitions and the full list of error codes, see Instance Health.
GPU Utilization
The GPU utilization overlay displays a heatmap across all nodes:
| Utilization Range | Color | Interpretation |
|---|---|---|
| 0–40% | Light Blue | Idle or starved — node may not be receiving work |
| 40–80% | Blue | Underutilized — potential bottleneck or inefficiency |
| 80–100% | Green | Healthy utilization — node is actively processing |
Low GPU utilization on specific nodes often indicates stragglers slowing collective operations. Use this overlay to locate affected nodes, then investigate further in Metrics.
CPU Utilization
The CPU utilization overlay highlights nodes with high CPU load:
| Utilization Range | Color | Interpretation |
|---|---|---|
| 0–70% | Green | Normal operating range |
| 70–90% | Blue | High load — monitor for potential contention |
| 90–100% | Light Blue | Saturated — workloads may be CPU-bound |
Node Details
Click any node tile to open a detail panel with the following information:
- VM name
- Current health status or GPU utilization or CPU utilization
- VM state
You can perform the following actions from the node detail panel:
- View historical metrics — Click Instance Details to navigate to node-level Metrics.
- Generate NVIDIA bug report — Create an NVIDIA bug report (only available for NVIDIA GPUs). Download it or attach to a support ticket. For possible error messages during generation, see Bug Report Generation Errors.
- Report an issue — Open a pre-filled support ticket with node information and latest NVIDIA bug report.
AutoClusters Integration
With AutoClusters enabled, Topology reflects remediation events in real time. Node pools undergoing replacement are marked as update in progress. New healthy nodes appear once replacement completes and metrics are collected. Remediation events also appear in Notifications.
What's Next
- Instance Health — Understand health status categories and error codes
- Metrics — Drill into node-level performance data
- Logs — Investigate system and application logs for specific nodes
- Notification — Get notified about resource health via email and in-console, and set up alert routing to Slack or webhooks