Skip to main content

Infrastructure and Topology Overview

Command Center provides two complementary views for monitoring your infrastructure at a glance: Infrastructure Overview and Topology. Both surfaces show GPU utilization and instance health, organized at different levels of granularity.

Infrastructure Overview

When you open Command Center, you land on the Infrastructure Overview page: a project-level summary of all compute resources, grouped by instance type such as H100, GB200, and MI355X. Whether your resources are standalone VMs or CMK clusters, they're organized together by hardware type—for example, instances running on the same network rail (e.g., NVL72) appear in a single tile, regardless of whether they run as VMs or CMK cluster nodes.

Each tile shows:

  • Node count
  • Average GPU utilization
  • P95 GPU power draw
  • P95 GPU temperature
  • InfiniBand throughput
  • Instance counts by health status

Click a tile to drill into cluster metrics or VM metrics for that instance type. Infrastructure Overview requires the Crusoe Watch Agent and is available by default in the Crusoe Console.

To access, navigate to Command Center in the left navigation > select Infra Overview.

Topology

Topology arranges your CMK cluster by network topology. Each node appears as a tile within its InfiniBand (IB) pod grouping, with color-coded overlays for health status, GPU utilization, and CPU utilization. Topology is available for Crusoe Managed Kubernetes (CMK) with InfiniBand networks and will be available for Crusoe Virtual Machines (VMs) and for RoCE networks in a future release.

Accessing the Topology View

Navigate to Orchestration > select your cluster > Topology sub-tab.

Topology Layout

Nodes are organized by network connectivity:

  • InfiniBand (IB) pod grouping — Nodes on the same InfiniBand network are grouped by InfiniBand Pod ID and then by node pool name. Each IB pod supports up to 32 VMs (256 GPUs).
  • Non-IB grouping — CPU-only nodes and GPU nodes without InfiniBand (L40S, A40S) are grouped by node pool ID.

Overlay Modes

You can switch between three overlay modes using the controls at the top of the Topology view.

Health Status

Each node displays a color-coded health status. An aggregated cluster count by health status appears at the top of the view. For status definitions and the full list of error codes, see Instance Health.

GPU Utilization

The GPU utilization overlay displays a heatmap across all nodes:

Utilization RangeColorInterpretation
0–40%Light BlueIdle or starved — node may not be receiving work
40–80%BlueUnderutilized — potential bottleneck or inefficiency
80–100%GreenHealthy utilization — node is actively processing

Low GPU utilization on specific nodes often indicates stragglers slowing collective operations. Use this overlay to locate affected nodes, then investigate further in Metrics.

CPU Utilization

The CPU utilization overlay highlights nodes with high CPU load:

Utilization RangeColorInterpretation
0–70%GreenNormal operating range
70–90%BlueHigh load — monitor for potential contention
90–100%Light BlueSaturated — workloads may be CPU-bound

Node Details

Click any node tile to open a detail panel with the following information:

  • VM name
  • Current health status or GPU utilization or CPU utilization
  • VM state

You can perform the following actions from the node detail panel:

  • View historical metrics — Click Instance Details to navigate to node-level Metrics.
  • Generate NVIDIA bug report — Create an NVIDIA bug report (only available for NVIDIA GPUs). Download it or attach to a support ticket. For possible error messages during generation, see Bug Report Generation Errors.
  • Report an issue — Open a pre-filled support ticket with node information and latest NVIDIA bug report.

AutoClusters Integration

With AutoClusters enabled, Topology reflects remediation events in real time. Node pools undergoing replacement are marked as update in progress. New healthy nodes appear once replacement completes and metrics are collected. Remediation events also appear in Notifications.

What's Next

  • Instance Health — Understand health status categories and error codes
  • Metrics — Drill into node-level performance data
  • Logs — Investigate system and application logs for specific nodes
  • Notification — Get notified about resource health via email and in-console, and set up alert routing to Slack or webhooks