Skip to main content

Active Health Checks

Active Health Checks runs automated GPU hardware checks on idle nodes in your CMK cluster. By periodically exercising GPU hardware while nodes are not serving workloads, Active Health Checks detects issues, such as memory errors and interconnect degradation, before they impact your jobs. This helps maintain the health and reliability of your GPU infrastructure at scale.

Enabling Active Health Checks

Active Health Checks is an opt-in add-on that can be selected during cluster creation.

Limited Availability

Active Health Checks is currently in limited availability. To enable it for your account, contact Crusoe Cloud Support.

  • New clusters: Select the Active Health Checks add-on during cluster creation through the UI, or via the --add-ons flag in the CLI.
  • Existing clusters: Contact Crusoe Cloud Support to have Active Health Checks enabled on your cluster.

How It Works

When Active Health Checks is enabled, automated test workloads are scheduled periodically on idle nodes in your cluster. Active Health Checks is supported on all GPU instance types (with the exception of B200, due to a known bug awaiting resolution). These tests exercise key aspects of GPU hardware, including:

  • GPU memory and compute — verifies that GPU memory and core compute operations are functioning correctly.
  • Interconnect bandwidth — on NVLink-equipped instances, measures GPU-to-GPU communication bandwidth to ensure interconnect performance meets expected baselines.

Test workloads run with the lowest Kubernetes priority. This means they are always the first to be evicted when a real workload needs resources — the Kubernetes scheduler will immediately preempt any running test to make room for your jobs.

Impact on Your Workloads

Active Health Checks is designed to have zero impact on your workloads:

  • Tests only run on idle nodes and do not consume GPU resources alongside your workloads.
  • If you schedule a workload on a node where a test is running, the test is immediately preempted by the Kubernetes scheduler. Your workload takes priority.
  • No action is required from you. Tests run automatically in the background.

Identifying Test Workloads

If you inspect your cluster and notice unfamiliar workloads, Active Health Checks workloads can be identified by the following:

  • They run in the crusoe-system namespace.
  • They run at the lowest Kubernetes priority.
  • They are labeled with app: autoclusters.

Custom Schedulers

danger

Do not enable Active Health Checks if you are running a custom scheduler — such as Volcano or KAI Scheduler — that does not respect standard Kubernetes PriorityClasses and preemption. Custom schedulers that bypass the default Kubernetes scheduling logic may not correctly preempt test workloads, which could lead to resource conflicts.

If you are unsure whether your scheduler is compatible, contact Crusoe Cloud Support before enabling Active Health Checks.

What Happens When a Test Fails

When a node fails an Active Health Checks check, the Crusoe team is automatically notified to investigate the failure. You may be contacted by support if any action is needed on your side.

Additionally, failed nodes are annotated with Kubernetes node conditions that you can use in your scheduling logic:

  • GPUDCGMUnhealthy — set when GPU memory or compute diagnostics fail.
  • GPUNVBandwidthUnhealthy — set when interconnect bandwidth falls below expected thresholds.

You can use these conditions with nodeAffinity rules or custom logic to avoid scheduling workloads on nodes that have not yet been investigated.