Instance Health

note

Instance Health is in preview for both CMK clusters and VMs. Not available for multi-tenant instances (L40s, A100).

Command Center displays a health status for each CMK node and standalone VM, giving you immediate visibility into infrastructure issues without manual log inspection. Health status is computed from telemetry signals collected by the Crusoe Watch Agent and refreshes every 60 seconds.

Health status is visible on the Infrastructure Overview pages, Topology View for CMK clusters, and VM detail pages.

Health Status Categories

There are four health status categories: Healthy, Degraded, Unhealthy, and Not Evaluated. For the conditions that determine each status, see Error Codes by Status below.

Error Codes by Status

Virtual Machines
CMK Clusters

Healthy

A VM is Healthy when it is in a Running state with no active error conditions and agent telemetry is flowing normally.

Condition
VM is in Running state
No active XID errors, no disruptive lifecycle events, agent telemetry is flowing

Degraded Conditions

The following conditions result in a Degraded status:

Error Code	Description
XID 119	GPU System Processor (GSP) not responding to driver RPC requests
XID 120	Driver failed to recover from GSP communication timeout
XID 140	Unrecovered ECC error
XID 143	GPU initialization failure

Unhealthy Conditions

The following conditions result in an Unhealthy status:

Error Code	Description
VM not in Running state	VM intended state is Running but current state is not
XID 32	Invalid or corrupted push buffer stream
XID 48	Uncorrectable double-bit ECC memory error
XID 64	GPU failed to record a memory error recovery action
XID 74	NVLink interconnect error
XID 79	GPU has fallen off the PCIe bus
XID 95	Uncontained ECC error
GPUFellOffTheBus	GPU lost from PCIe bus (lifecycle event)
HCAFellOffTheBus	Host Channel Adapter (InfiniBand) lost (lifecycle event)
Loss of agent telemetry	Crusoe Watch Agent stopped reporting

Not Evaluated

A VM is Not Evaluated when it is in a transitional state (being provisioned or deleted), or when the Crusoe Watch Agent is not installed.

Healthy

A CMK node is Healthy when it is in a Ready state with no active error conditions and agent telemetry is flowing normally.

Condition
CMK node is in Ready state
No active XID errors, no disruptive lifecycle events, agent telemetry is flowing

Degraded Conditions

The following conditions result in a Degraded status:

Error Code	Description
MemoryPressure	Host memory exhaustion detected
DiskPressure	Low disk space detected
PIDPressure	Process table nearing exhaustion
XID 119	GPU System Processor (GSP) not responding to driver RPC requests
XID 120	Driver failed to recover from GSP communication timeout
XID 140	Unrecovered ECC error
XID 143	GPU initialization failure
GPU mis-match	GPU capacity exceeds allocatable count, indicating partial GPU failure

Unhealthy Conditions

The following conditions result in an Unhealthy status:

Error Code	Description
Node Not Ready	Kubernetes node state is not equal to true
XID 32	Invalid or corrupted push buffer stream
XID 48	Uncorrectable double-bit ECC memory error
XID 64	GPU failed to record a memory error recovery action
XID 74	NVLink interconnect error
XID 79	GPU has fallen off the PCIe bus
XID 95	Uncontained ECC error
GPU unavailable	GPU capacity is zero — all GPUs on the node are unavailable
GPUFellOffTheBus	GPU lost from PCIe bus (lifecycle event)
HCAFellOffTheBus	Host Channel Adapter (InfiniBand) lost (lifecycle event)
Loss of agent telemetry	Crusoe Watch Agent stopped reporting

Not Evaluated

A CMK node is Not Evaluated when it is in a transitional state (being provisioned or deleted), or when the Crusoe Watch Agent is not installed.

Bug Report Generation Error Messages

When generating an NVIDIA or AMD bug report from the node detail panel, the following error messages may appear:

Error Message	Condition
Bug report script unavailable	Script not found on the node
Bug report script execution timed out	Script subprocess timed out
Bug report script failed with return code: `{code}`	Script exited with non-zero return code
NVIDIA driver pod not found	NVIDIA driver pod not found on node
Error executing bug report script	Kubernetes API error during script execution
Bug report script returned no output	Script produced no expected output
Unexpected error downloading bug report	Failed to download log file from driver pod
Bug report generation timed out	Overall collection timed out
Bug report upload failed	Upload succeeded but result reporting failed
Internal Server Error	Unknown error during collection

Relationship to AutoClusters

Health status combines resource lifecycle state and GPU telemetry (XID codes). A node may report as Running in Kubernetes but be marked Degraded or Unhealthy based on GPU errors or resource pressure.

Health status is distinct from AutoClusters remediation actions. A Degraded node may be in detect-only mode without triggering automatic replacement. With AutoClusters enabled, remediation events appear in Notifications.

What's Next

Infrastructure and Topology Overview — See health overlays on the fleet overview and cluster topology view
Notifications — Get notified about GPU XID errors and hardware failures
AutoClusters — Automated hardware failure detection and node replacement

Health Status Categories​

Error Codes by Status​

Healthy​

Degraded Conditions​

Unhealthy Conditions​

Not Evaluated​

Healthy​

Degraded Conditions​

Unhealthy Conditions​

Not Evaluated​

Bug Report Generation Error Messages​

Relationship to AutoClusters​

What's Next​

Health Status Categories

Error Codes by Status

Healthy

Degraded Conditions

Unhealthy Conditions

Not Evaluated

Healthy

Degraded Conditions

Unhealthy Conditions

Not Evaluated

Bug Report Generation Error Messages

Relationship to AutoClusters

What's Next