Skip to main content

Instance Health

note

Instance Health is in preview for both CMK clusters and VMs. Not available for multi-tenant instances (L40s, A100).

Command Center displays a health status for each CMK node and standalone VM, giving you immediate visibility into infrastructure issues without manual log inspection. Health status is computed from telemetry signals collected by the Crusoe Watch Agent and refreshes every 60 seconds.

Health status is visible on the Infrastructure Overview pages, Topology View for CMK clusters, and VM detail pages.

Health Status Categories

There are four health status categories: Healthy, Degraded, Unhealthy, and Not Evaluated. For the conditions that determine each status, see Error Codes by Status below.

Error Codes by Status

Healthy

A VM is Healthy when it is in a Running state with no active error conditions and agent telemetry is flowing normally.

Condition
VM is in Running state
No active XID errors, no disruptive lifecycle events, agent telemetry is flowing

Degraded Conditions

The following conditions result in a Degraded status:

Error CodeDescription
XID 119GPU System Processor (GSP) not responding to driver RPC requests
XID 120Driver failed to recover from GSP communication timeout
XID 140Unrecovered ECC error
XID 143GPU initialization failure

Unhealthy Conditions

The following conditions result in an Unhealthy status:

Error CodeDescription
VM not in Running stateVM intended state is Running but current state is not
XID 32Invalid or corrupted push buffer stream
XID 48Uncorrectable double-bit ECC memory error
XID 64GPU failed to record a memory error recovery action
XID 74NVLink interconnect error
XID 79GPU has fallen off the PCIe bus
XID 95Uncontained ECC error
GPUFellOffTheBusGPU lost from PCIe bus (lifecycle event)
HCAFellOffTheBusHost Channel Adapter (InfiniBand) lost (lifecycle event)
Loss of agent telemetryCrusoe Watch Agent stopped reporting

Not Evaluated

A VM is Not Evaluated when it is in a transitional state (being provisioned or deleted), or when the Crusoe Watch Agent is not installed.

Bug Report Generation Error Messages

When generating an NVIDIA bug report from the node detail panel, the following error messages may appear:

Error MessageCondition
Bug report script unavailableScript not found on the node
Bug report script execution timed outScript subprocess timed out
Bug report script failed with return code: {code}Script exited with non-zero return code
NVIDIA driver pod not foundNVIDIA driver pod not found on node
Error executing bug report scriptKubernetes API error during script execution
Bug report script returned no outputScript produced no expected output
Unexpected error downloading bug reportFailed to download log file from driver pod
Bug report generation timed outOverall collection timed out
Bug report upload failedUpload succeeded but result reporting failed
Internal Server ErrorUnknown error during collection

Relationship to AutoClusters

Health status combines resource lifecycle state and GPU telemetry (XID codes). A node may report as Running in Kubernetes but be marked Degraded or Unhealthy based on GPU errors or resource pressure.

Health status is distinct from AutoClusters remediation actions. A Degraded node may be in detect-only mode without triggering automatic replacement. With AutoClusters enabled, remediation events appear in Notifications.

What's Next