Instance Health
Instance Health is in preview for both CMK clusters and VMs. Not available for multi-tenant instances (L40s, A100).
Command Center displays a health status for each CMK node and standalone VM, giving you immediate visibility into infrastructure issues without manual log inspection. Health status is computed from telemetry signals collected by the Crusoe Watch Agent and refreshes every 60 seconds.
Health status is visible on the Infrastructure Overview pages, Topology View for CMK clusters, and VM detail pages.
Health Status Categories
There are four health status categories: Healthy, Degraded, Unhealthy, and Not Evaluated. For the conditions that determine each status, see Error Codes by Status below.
Error Codes by Status
- Virtual Machines
- CMK Clusters
Healthy
A VM is Healthy when it is in a Running state with no active error conditions and agent telemetry is flowing normally.
| Condition |
|---|
| VM is in Running state |
| No active XID errors, no disruptive lifecycle events, agent telemetry is flowing |
Degraded Conditions
The following conditions result in a Degraded status:
| Error Code | Description |
|---|---|
| XID 119 | GPU System Processor (GSP) not responding to driver RPC requests |
| XID 120 | Driver failed to recover from GSP communication timeout |
| XID 140 | Unrecovered ECC error |
| XID 143 | GPU initialization failure |
Unhealthy Conditions
The following conditions result in an Unhealthy status:
| Error Code | Description |
|---|---|
| VM not in Running state | VM intended state is Running but current state is not |
| XID 32 | Invalid or corrupted push buffer stream |
| XID 48 | Uncorrectable double-bit ECC memory error |
| XID 64 | GPU failed to record a memory error recovery action |
| XID 74 | NVLink interconnect error |
| XID 79 | GPU has fallen off the PCIe bus |
| XID 95 | Uncontained ECC error |
| GPUFellOffTheBus | GPU lost from PCIe bus (lifecycle event) |
| HCAFellOffTheBus | Host Channel Adapter (InfiniBand) lost (lifecycle event) |
| Loss of agent telemetry | Crusoe Watch Agent stopped reporting |
Not Evaluated
A VM is Not Evaluated when it is in a transitional state (being provisioned or deleted), or when the Crusoe Watch Agent is not installed.
Healthy
A CMK node is Healthy when it is in a Ready state with no active error conditions and agent telemetry is flowing normally.
| Condition |
|---|
| CMK node is in Ready state |
| No active XID errors, no disruptive lifecycle events, agent telemetry is flowing |
Degraded Conditions
The following conditions result in a Degraded status:
| Error Code | Description |
|---|---|
| MemoryPressure | Host memory exhaustion detected |
| DiskPressure | Low disk space detected |
| PIDPressure | Process table nearing exhaustion |
| XID 119 | GPU System Processor (GSP) not responding to driver RPC requests |
| XID 120 | Driver failed to recover from GSP communication timeout |
| XID 140 | Unrecovered ECC error |
| XID 143 | GPU initialization failure |
| GPU mis-match | GPU capacity exceeds allocatable count, indicating partial GPU failure |
Unhealthy Conditions
The following conditions result in an Unhealthy status:
| Error Code | Description |
|---|---|
| Node Not Ready | Kubernetes node state is not equal to true |
| XID 32 | Invalid or corrupted push buffer stream |
| XID 48 | Uncorrectable double-bit ECC memory error |
| XID 64 | GPU failed to record a memory error recovery action |
| XID 74 | NVLink interconnect error |
| XID 79 | GPU has fallen off the PCIe bus |
| XID 95 | Uncontained ECC error |
| GPU unavailable | GPU capacity is zero — all GPUs on the node are unavailable |
| GPUFellOffTheBus | GPU lost from PCIe bus (lifecycle event) |
| HCAFellOffTheBus | Host Channel Adapter (InfiniBand) lost (lifecycle event) |
| Loss of agent telemetry | Crusoe Watch Agent stopped reporting |
Not Evaluated
A CMK node is Not Evaluated when it is in a transitional state (being provisioned or deleted), or when the Crusoe Watch Agent is not installed.
Bug Report Generation Error Messages
When generating an NVIDIA bug report from the node detail panel, the following error messages may appear:
| Error Message | Condition |
|---|---|
| Bug report script unavailable | Script not found on the node |
| Bug report script execution timed out | Script subprocess timed out |
Bug report script failed with return code: {code} | Script exited with non-zero return code |
| NVIDIA driver pod not found | NVIDIA driver pod not found on node |
| Error executing bug report script | Kubernetes API error during script execution |
| Bug report script returned no output | Script produced no expected output |
| Unexpected error downloading bug report | Failed to download log file from driver pod |
| Bug report generation timed out | Overall collection timed out |
| Bug report upload failed | Upload succeeded but result reporting failed |
| Internal Server Error | Unknown error during collection |
Relationship to AutoClusters
Health status combines resource lifecycle state and GPU telemetry (XID codes). A node may report as Running in Kubernetes but be marked Degraded or Unhealthy based on GPU errors or resource pressure.
Health status is distinct from AutoClusters remediation actions. A Degraded node may be in detect-only mode without triggering automatic replacement. With AutoClusters enabled, remediation events appear in Notifications.
What's Next
- Infrastructure and Topology Overview — See health overlays on the fleet overview and cluster topology view
- Notifications — Get notified about GPU XID errors and hardware failures
- AutoClusters — Automated hardware failure detection and node replacement