Automated Node Remediation with AutoClusters
AutoClusters enhances the resilience of your CMK workloads by automatically detecting and resolving common hardware failures. By enabling AutoClusters you can minimize downtime and reduce the need for manual intervention, ensuring higher effective utilization for your clusters.
This guide explains how AutoClusters works, how to enable it for your deployments, and what to expect during the automated remediation process.
Supported Versions and Hardware
AutoClusters is currently supported on the following minimum CMK versions:
1.33.4-cmk.201.32.7-cmk.221.31.7-cmk.24
Future minor and patch releases within these Kubernetes versions (e.g., 1.33.5-cmk.X, 1.32.8-cmk.X) are also supported.
AutoClusters can remediate issues for Kubernetes nodes running on the following Crusoe GPU instance types:
- 10x NVIDIA L40S (
l40s-48gb.10x) - 8x Nvidia A100 80GB (
a100-80gb.8x,a100-80gb-sxm-ib.8x) - 8x Nvidia H100 80GB (
h100-80gb-sxm-ib.8x) - 8x NVIDIA H200 141GB (
h200-141gb-sxm-ib.8x) - 8x NVIDIA B200 180GB (
b200-180gb-sxm-ib-8x) - 4x NVIDIA GB200 186GB (
gb200-186gb-nvl-4x)
AutoClusters does not support remediation on multi-tenant VM types where multiple Kubernetes nodes may be co-located on the same physical host. In those configurations, node replacement cannot be safely and deterministically executed, and AutoClusters will not trigger remediation.
How it Works
AutoClusters continuously monitors your infrastructure for hardware-related errors. When a critical issue is detected on a node, AutoClusters initiates a remediation process based on standard Kubernetes procedures.
The process involves:
- Graceful Termination: Your workloads are given time to shut down cleanly.
- Node Restart or Replacement: The unhealthy node is either restarted or removed from the cluster and replaced with a healthy one.
- Workload Rescheduling: Your pods are automatically rescheduled onto the new node.
This entire process is automated, allowing your workloads to recover from hardware failures without manual intervention.
Enabling AutoClusters
Prerequisite: Enable the AutoClusters Addon
AutoClusters is a Kubernetes add-on that must be enabled at the cluster level.
- For new clusters: You can enable the AutoClusters addon during the cluster creation process through the UI by selecting the AutoClusters add-on, or through the CLI via the
--add-onsflag. - For existing clusters: Please contact our support team to have the AutoClusters addon enabled for your existing cluster.
Once the addon is enabled for your cluster, all nodes with hardware failures will be automatically remediated unless you opt out specific workloads. It is recommended that you implement preStop hooks for any workloads that require graceful shutdown.
Opting Out of Automated Remediation (Optional)
By default, all nodes are eligible for automated remediation. If you have specific workloads that should prevent automatic node replacement, you can opt them out by adding the autoclusters.crusoe.ai/remediationPolicy: "Disabled" label to the pod template of your resource (e.g., a Job or Deployment).
Here's an example for a Kubernetes Deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: your-deployment-name
spec:
# ...
template:
metadata:
labels:
autoclusters.crusoe.ai/remediationPolicy: "Disabled"
# ...
All pods created from this template will inherit the label, preventing automatic remediation on any node where these pods are running. If any pod on a node has this label set to "Disabled", AutoClusters will not remediate that node.
Implementing preStop Hooks for Graceful Shutdown
As a Kubernetes best practice, you should implement a preStop hook in your containers to ensure graceful shutdown during pod termination. This is important not just for AutoClusters remediation, but for any scenario where pods are deleted or rescheduled (such as deployments, node maintenance, or resource constraints).
Without a preStop hook, when remediation occurs, your workloads will be moved to another node but any in-progress work may be lost. The preStop hook is executed before the pod is terminated, giving your application time to save its state, checkpoint its work, or close any open connections.
Here's an example of a preStop hook that runs a shell script to save state:
spec:
template:
# ...
spec:
containers:
- name: my-container
image: my-image
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "/app/your-save-state-script.sh"]
# Optionally set a grace period for your hook to complete. If you do not set this value, the default grace period of 30 seconds will apply.
terminationGracePeriodSeconds: 120
# ...
In this example, your-save-state-script.sh is a script you provide to handle the graceful shutdown of your application.
Advanced Configuration and Extensibility
The advanced features described in this section are in Limited Availability. Ask Crusoe Support to enable them for your account.
By default, AutoClusters automatically detects and remediates hardware failures. However, you may want to implement your own remediation logic or integrate with external systems. AutoClusters extensibility allows you to:
- Override automatic remediation behavior for specific issue types
- Trigger remediation manually via API when your custom logic determines it's appropriate
These features are currently available through the CLI and API. UI support will be added in future releases. This documentation covers CLI usage. See the API docs for API usage.
Remediation Configuration
Viewing Current Configuration
To view your cluster's remediation configuration, including both default behavior and any overrides:
crusoe kubernetes autoclusters config get --project-id YOUR_PROJECT_ID --cluster-id YOUR_CLUSTER_ID
This shows all supported issue types (e.g., XID_64 for GPU errors), their default actions, and any overrides you've configured.
Overriding Remediation Behavior
You can override the default remediation action for specific issue types:
Available Actions:
REPLACE_NODE: Automatically drain and replace the node when the issue is detected (default for most hardware issues)OFF: Disable automatic remediation (you'll receive notifications but no automatic action)
To disable automatic remediation for a specific issue type :
crusoe kubernetes autoclusters config set-remediation-override \
--project-id YOUR_PROJECT_ID \
--cluster-id YOUR_CLUSTER_ID \
--override XID_64=OFF
To remove an override and return to default behavior (for example XID_64):
crusoe kubernetes autoclusters config remove-remediation-override \
--project-id YOUR_PROJECT_ID \
--cluster-id YOUR_CLUSTER_ID \
--issue XID_64
Manual Remediation
Manual remediation allows you to trigger node replacement on-demand via API, giving you complete control over when remediation occurs.
Eligibility Requirements
You can trigger manual remediation for a VM only when:
- AutoClusters detected a qualifying issue on the VM within the past 24 hours
- The VM is on a single-tenancy node
- The cluster has the AutoClusters add-on enabled
Triggering Manual Remediation
crusoe kubernetes autoclusters remediations replace-node \
--project-id YOUR_PROJECT_ID \
--cluster-id YOUR_CLUSTER_ID \
--vm-id YOUR_VM_ID \
Manual remediation events appear in your cluster's AutoClusters tab in the Crusoe Cloud Console.
Open Loop Workflow
You can implement fully custom remediation logic by combining remediation overrides with webhooks:
Disable automatic remediation for the issue types you want to handle manually:
crusoe kubernetes autoclusters config set-remediation-override \
--project-id YOUR_PROJECT_ID \
--cluster-id YOUR_CLUSTER_ID \
--override XID_64=OFFConfigure a webhook to receive notifications when AutoClusters detects issues. See Configuring Webhook Notifications for setup instructions.
Implement your custom logic to decide when and how to remediate. Your webhook endpoint receives issue detection events and can implement any decision logic you need (e.g., checking external systems, waiting for specific conditions, coordinating with schedulers).
Call the remediation API when your logic determines it's appropriate:
crusoe kubernetes autoclusters remediations replace-node \
--project-id YOUR_PROJECT_ID \
--cluster-id YOUR_CLUSTER_ID \
--vm-id YOUR_VM_ID \
This gives you complete control over the remediation process while still leveraging AutoClusters' detection capabilities and node replacement infrastructure.
The Remediation Process
When AutoClusters detects a hardware failure that requires node replacement (or when the remediation API is manually triggered), it automatically performs the following steps:
- Verification: AutoClusters confirms that no pods on the node have the
autoclusters.crusoe.ai/remediationPolicy: "Disabled"label. If any pod on the node has this label set to"Disabled", AutoClusters will not remediate the node. - Node Drain: The Kubernetes node is cordoned and drained, which gracefully evicts all pods. Your container
preStophooks are executed at this stage. - Node Replacement: The unhealthy node is removed from the node pool and replaced with a new, healthy node from spare capacity. The system first checks to see if there are any suitable spare nodes in the Crusoe on-demand pool. If none are found, the system proceeds to look for suitable spares in Crusoe's hot spare inventory. If no spares can be found, the Crusoe team is notified to take action. You will also be sent a notification via e-mail (or other notification channels you have set up) to take action as necessary.
- History Logging: A record of the alert and the remediation event is logged and made available in the Crusoe Cloud Console for your review.
Node Remediation History
You can view the history of remediation events for a cluster on the Remediations tab in the cluster detail view.
Edge Cases and Limitations
While AutoClusters is designed to handle failures automatically, there are some important things to be aware of:
- Remediation Failures: If the triggered remediation action fails for any reason (e.g. no spare nodes are available to replace the unhealthy one), the remediation process will be aborted, you will be notified, and our support team will be notified to assist.
- Excessive Remediations: If a high number of remediation actions are triggered in a short time period, circuit breakers will pause future remediations and our support team will step in to investigate and address underlying causes.
- Conservative Alert Triggers: AutoClusters uses a comprehensive but conservative set of hardware failure detection rules to avoid over-remediation. We only remediate for failures we know require node replacement, ensuring your workloads aren't disrupted unnecessarily.