Automated Node Remediation with AutoClusters
AutoClusters enhances the resilience of your CMK workloads by automatically detecting and resolving common hardware failures. By enabling AutoClusters you can minimize downtime and reduce the need for manual intervention, ensuring higher effective utilization for your clusters.
This guide explains how AutoClusters works, how to enable it for your deployments, and what to expect during the automated remediation process.
Supported Versions and Hardware
AutoClusters is currently supported on the following minimum CMK versions:
1.33.4-cmk.201.32.7-cmk.221.31.7-cmk.24
Kubernetes minor versions greater than 1.33 are all supported, for all patch versions. Future minor and patch releases within the above Kubernetes versions (e.g., 1.33.5-cmk.X, 1.32.8-cmk.X) are also supported.
AutoClusters can remediate issues for Kubernetes nodes running on the following Crusoe GPU instance types:
- 10x NVIDIA L40S (
l40s-48gb.10x) - 8x Nvidia A100 80GB (
a100-80gb.8x,a100-80gb-sxm-ib.8x) - 8x Nvidia H100 80GB (
h100-80gb-sxm-ib.8x) - 8x NVIDIA H200 141GB (
h200-141gb-sxm-ib.8x) - 8x NVIDIA B200 180GB (
b200-180gb-sxm-ib-8x) - 4x NVIDIA GB200 186GB (
gb200-186gb-nvl-4x) - 8x NVIDIA B300 288GB (
b300-288gb-sxm-ib.8x)
AutoClusters does not support remediation on multi-tenant VM types where multiple Kubernetes nodes may be co-located on the same physical host. In those configurations, node replacement cannot be safely and deterministically executed, and AutoClusters will not trigger remediation.
How it Works
AutoClusters continuously monitors your infrastructure for hardware-related errors. When a critical issue is detected on a node, AutoClusters initiates a remediation process based on standard Kubernetes procedures.
The process involves:
- Graceful Termination: Your workloads are given time to shut down cleanly.
- Node Restart or Replacement: The unhealthy node is either restarted or removed from the cluster and replaced with a healthy one.
- Workload Rescheduling: Your pods are automatically rescheduled onto the new node.
This entire process is automated, allowing your workloads to recover from hardware failures without manual intervention.
Remediation only runs for the issue types you have turned on. When the AutoClusters addon is first enabled, every issue type defaults to OFF, so AutoClusters detects failures and sends notifications but does not replace any nodes until you enable remediations. See Enable Remediations.
Enabling AutoClusters
Step 1: Enable the AutoClusters Addon
AutoClusters is a Kubernetes add-on that must be enabled at the cluster level.
- For new clusters: You can enable the AutoClusters addon during the cluster creation process through the UI by selecting the AutoClusters add-on, or through the CLI via the
--add-onsflag. - For existing clusters: Please contact our support team to have the AutoClusters addon enabled for your existing cluster.
Enabling the addon turns on hardware-failure detection, but all remediation actions default to OFF. No nodes are replaced until you explicitly enable remediations in Step 2. Until then AutoClusters will take no automatic action.
Step 2: Enable Remediations
Because every issue type defaults to OFF, you turn AutoClusters on by setting the issue types you want it to act on to REPLACE_NODE. This gives you control over exactly which hardware failures trigger automatic node replacement.
First, review your current configuration. This is especially important if AutoClusters was set up on this cluster previously — you may already have overrides in place, and checking first avoids unexpected changes:
crusoe kubernetes autoclusters config get --project-id YOUR_PROJECT_ID --cluster-id YOUR_CLUSTER_ID
This lists every supported issue type with its default action, any override you've set, and the resulting effective action. See Remediation Configuration for how to read this output.
Then enable remediations. To turn on automatic node replacement for all supported issue types in a single command:
crusoe kubernetes autoclusters config set-remediation-override \
--project-id YOUR_PROJECT_ID \
--cluster-id YOUR_CLUSTER_ID \
--override GPU_FELL_OFF_THE_BUS=REPLACE_NODE \
--override HCA_FELL_OFF_THE_BUS=REPLACE_NODE \
--override HCA_POLLING=REPLACE_NODE \
--override NVSWITCH_FELL_OFF_THE_BUS=REPLACE_NODE \
--override PCI_LINK_DOWN=REPLACE_NODE \
--override XID_119=REPLACE_NODE \
--override XID_120=REPLACE_NODE \
--override XID_48=REPLACE_NODE \
--override XID_64=REPLACE_NODE \
--override XID_74=REPLACE_NODE \
--override XID_79=REPLACE_NODE
To enable only certain issue types, include just those --override flags. See Remediation Configuration for per-issue control and how to turn remediations back off.
Once remediations are enabled, AutoClusters will drain and replace nodes when it detects the corresponding hardware failures. Implement preStop hooks for any workloads that require graceful shutdown, and use the opt-out label below for any workloads that should never trigger node replacement.
Opting Out of Automated Remediation (Optional)
Once you have enabled remediations, all nodes are eligible for automated node replacement. If you have specific workloads that should prevent automatic node replacement, you can opt them out by adding the autoclusters.crusoe.ai/remediationPolicy: "Disabled" label to the pod template of your resource (e.g., a Job or Deployment).
Here's an example for a Kubernetes Deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: your-deployment-name
spec:
# ...
template:
metadata:
labels:
autoclusters.crusoe.ai/remediationPolicy: "Disabled"
# ...
All pods created from this template will inherit the label, preventing automatic remediation on any node where these pods are running. If any pod on a node has this label set to "Disabled", AutoClusters will not remediate that node.
Implementing preStop Hooks for Graceful Shutdown
As a Kubernetes best practice, you should implement a preStop hook in your containers to ensure graceful shutdown during pod termination. This is important not just for AutoClusters remediation, but for any scenario where pods are deleted or rescheduled (such as deployments, node maintenance, or resource constraints).
Without a preStop hook, when remediation occurs, your workloads will be moved to another node but any in-progress work may be lost. The preStop hook is executed before the pod is terminated, giving your application time to save its state, checkpoint its work, or close any open connections.
Here's an example of a preStop hook that runs a shell script to save state:
spec:
template:
# ...
spec:
containers:
- name: my-container
image: my-image
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "/app/your-save-state-script.sh"]
# Optionally set a grace period for your hook to complete. If you do not set this value, the default grace period of 30 seconds will apply.
terminationGracePeriodSeconds: 120
# ...
In this example, your-save-state-script.sh is a script you provide to handle the graceful shutdown of your application.
Remediation Configuration
Remediation behavior is controlled per issue type. Use these commands to inspect the current configuration and to enable or disable automatic node replacement for individual issue types.
Viewing Current Configuration
To view your cluster's remediation configuration, including both default behavior and any overrides:
crusoe kubernetes autoclusters config get --project-id YOUR_PROJECT_ID --cluster-id YOUR_CLUSTER_ID
This shows all supported issue types (e.g., XID_64 for GPU errors), their default action, any override you've configured, and the resulting effective action. Because the default action for every issue type is OFF, an issue is only remediated automatically if you have set an override of REPLACE_NODE for it.
Setting Remediation Actions
Each issue type has two possible actions:
REPLACE_NODE: Automatically drain and replace the node when the issue is detected.OFF: Take no automatic action — you still receive notifications, but the node is not replaced. This is the default for every issue type until you enable remediations.
Setting an action creates an override for that issue type. To enable automatic node replacement for a specific issue type (for example, XID_64):
crusoe kubernetes autoclusters config set-remediation-override \
--project-id YOUR_PROJECT_ID \
--cluster-id YOUR_CLUSTER_ID \
--override XID_64=REPLACE_NODE
To turn remediation back off for a specific issue type:
crusoe kubernetes autoclusters config set-remediation-override \
--project-id YOUR_PROJECT_ID \
--cluster-id YOUR_CLUSTER_ID \
--override XID_64=OFF
To remove an override entirely and return the issue type to its default action (for example XID_64):
crusoe kubernetes autoclusters config remove-remediation-override \
--project-id YOUR_PROJECT_ID \
--cluster-id YOUR_CLUSTER_ID \
--issue XID_64
To enable every supported issue type at once, see the one-shot command in Step 2: Enable Remediations.
Advanced Configuration and Extensibility
The advanced features described in this section are in Limited Availability. Ask Crusoe Support to enable them for your account.
Beyond enabling and disabling automatic remediation, you may want to implement your own remediation logic or integrate with external systems. AutoClusters extensibility allows you to trigger remediation manually via API when your custom logic determines it's appropriate — for example, to build an open-loop workflow that pairs AutoClusters detection with your own decision-making.
These features are currently available through the CLI and API. UI support will be added in future releases. This documentation covers CLI usage. See the API docs for API usage.
Manual Remediation
Manual remediation allows you to trigger node replacement on-demand via API, giving you complete control over when remediation occurs.
Eligibility Requirements
You can trigger manual remediation for a VM only when:
- AutoClusters detected a qualifying issue on the VM within the past 24 hours
- The VM is on a single-tenancy node
- The cluster has the AutoClusters add-on enabled
Triggering Manual Remediation
crusoe kubernetes autoclusters remediations replace-node \
--project-id YOUR_PROJECT_ID \
--cluster-id YOUR_CLUSTER_ID \
--vm-id YOUR_VM_ID \
Manual remediation events appear in your cluster's AutoClusters tab in the Crusoe Cloud Console.
Open Loop Workflow
You can implement fully custom remediation logic by combining remediation overrides with webhooks:
-
Keep automatic remediation off for the issue types you want to handle yourself.
OFFis the default, so no action is needed unless you previously enabled the issue type — in which case set it back toOFF:crusoe kubernetes autoclusters config set-remediation-override \--project-id YOUR_PROJECT_ID \--cluster-id YOUR_CLUSTER_ID \--override XID_64=OFF -
Configure a webhook to receive notifications when AutoClusters detects issues. See Configuring Webhook Notifications for setup instructions.
-
Implement your custom logic to decide when and how to remediate. Your webhook endpoint receives issue detection events and can implement any decision logic you need (e.g., checking external systems, waiting for specific conditions, coordinating with schedulers).
-
Call the remediation API when your logic determines it's appropriate:
crusoe kubernetes autoclusters remediations replace-node \--project-id YOUR_PROJECT_ID \--cluster-id YOUR_CLUSTER_ID \--vm-id YOUR_VM_ID \
This gives you complete control over the remediation process while still leveraging AutoClusters' detection capabilities and node replacement infrastructure.
The Remediation Process
When AutoClusters detects a hardware failure that requires node replacement (or when the remediation API is manually triggered), it automatically performs the following steps:
- Verification: AutoClusters confirms that no pods on the node have the
autoclusters.crusoe.ai/remediationPolicy: "Disabled"label. If any pod on the node has this label set to"Disabled", AutoClusters will not remediate the node. - Node Drain: The Kubernetes node is cordoned and drained, which gracefully evicts all pods. Your container
preStophooks are executed at this stage. - Node Replacement: The unhealthy node is removed from the node pool and replaced with a new, healthy node from spare capacity. The system first checks to see if there are any suitable spare nodes in the Crusoe on-demand pool. If none are found, the system proceeds to look for suitable spares in Crusoe's hot spare inventory. If no spares can be found, the Crusoe team is notified to take action. You will also be sent a notification via e-mail (or other notification channels you have set up) to take action as necessary.
- History Logging: A record of the alert and the remediation event is logged and made available in the Crusoe Cloud Console for your review.
Node Remediation History
You can view the history of remediation events for a cluster on the Remediations tab in the cluster detail view.
Edge Cases and Limitations
While AutoClusters is designed to handle failures automatically, there are some important things to be aware of:
- Remediation Failures: If the triggered remediation action fails for any reason (e.g. no spare nodes are available to replace the unhealthy one), the remediation process will be aborted, you will be notified, and our support team will be notified to assist.
- Excessive Remediations: If a high number of remediation actions are triggered in a short time period, circuit breakers will pause future remediations and our support team will step in to investigate and address underlying causes.
- Conservative Alert Triggers: AutoClusters uses a comprehensive but conservative set of hardware failure detection rules to avoid over-remediation. We only remediate for failures we know require node replacement, ensuring your workloads aren't disrupted unnecessarily.