Automated Node Remediation with AutoClusters
AutoClusters enhances the resilience of your CMK workloads by automatically detecting and resolving common hardware failures. By enabling AutoClusters you can minimize downtime and reduce the need for manual intervention, ensuring higher effective utilization for your clusters.
This guide explains how AutoClusters works, how to enable it for your deployments, and what to expect during the automated remediation process.
Supported Versions and Hardware
AutoClusters is currently supported on the following minimum CMK versions:
1.33.4-cmk.131.32.7-cmk.161.31.7-cmk.19
Future minor and patch releases within these Kubernetes versions (e.g., 1.33.5-cmk.X, 1.32.8-cmk.X) are also supported.
AutoClusters can remediate issues for Kubernetes nodes running on the following Crusoe GPU instance types:
- 10x NVIDIA L40S (
l40s-48gb.10x) - 8x Nvidia A100 80GB (
a100-80gb.8x,a100-80gb-sxm-ib.8x) - 8x Nvidia H100 80GB (
h100-80gb-sxm-ib.8x) - 8x NVIDIA H200 141GB (
h200-141gb-sxm-ib.8x) - 8x NVIDIA B200 180GB (
b200-180gb-sxm-ib-8x) - 4x NVIDIA GB200 186GB (
gb200-186gb-nvl-4x)
AutoClusters does not support remediation on multi-tenant VM types where multiple Kubernetes nodes may be co-located on the same physical host. In those configurations, node replacement cannot be safely and deterministically executed, and AutoClusters will not trigger remediation.
How it Works
AutoClusters continuously monitors your infrastructure for hardware-related errors. When a critical issue is detected on a node, AutoClusters initiates a remediation process based on standard Kubernetes procedures.
The process involves:
- Graceful Termination: Your workloads are given time to shut down cleanly.
- Node Restart or Replacement: The unhealthy node is either restarted or removed from the cluster and replaced with a healthy one.
- Workload Rescheduling: Your pods are automatically rescheduled onto the new node.
This entire process is automated, allowing your workloads to recover from hardware failures without manual intervention.
Enabling AutoClusters
Prerequisite: Enable the AutoClusters Addon
AutoClusters is a Kubernetes add-on that must be enabled at the cluster level.
- For new clusters: You can enable the AutoClusters addon during the cluster creation process through the UI by selecting the AutoClusters add-on, or through the CLI via the
--add-onsflag. - For existing clusters: Please contact our support team to have the AutoClusters addon enabled for your existing cluster.
Once the addon is enabled for your cluster, all nodes with hardware failures will be automatically remediated unless you opt out specific workloads. It is recommended that you implement preStop hooks for any workloads that require graceful shutdown.
Opting Out of Automated Remediation (Optional)
By default, all nodes are eligible for automated remediation. If you have specific workloads that should prevent automatic node replacement, you can opt them out by adding the autoclusters.crusoe.ai/remediationPolicy: "Disabled" label to the pod template of your resource (e.g., a Job or Deployment).
Here's an example for a Kubernetes Deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: your-deployment-name
spec:
# ...
template:
metadata:
labels:
autoclusters.crusoe.ai/remediationPolicy: "Disabled"
# ...
All pods created from this template will inherit the label, preventing automatic remediation on any node where these pods are running. If any pod on a node has this label set to "Disabled", AutoClusters will not remediate that node.
Implementing preStop Hooks for Graceful Shutdown
As a Kubernetes best practice, you should implement a preStop hook in your containers to ensure graceful shutdown during pod termination. This is important not just for AutoClusters remediation, but for any scenario where pods are deleted or rescheduled (such as deployments, node maintenance, or resource constraints).
Without a preStop hook, when remediation occurs, your workloads will be moved to another node but any in-progress work may be lost. The preStop hook is executed before the pod is terminated, giving your application time to save its state, checkpoint its work, or close any open connections.
Here's an example of a preStop hook that runs a shell script to save state:
spec:
template:
# ...
spec:
containers:
- name: my-container
image: my-image
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "/app/your-save-state-script.sh"]
# Optionally set a grace period for your hook to complete. If you do not set this value, the default grace period of 30 seconds will apply.
terminationGracePeriodSeconds: 120
# ...
In this example, your-save-state-script.sh is a script you provide to handle the graceful shutdown of your application.
The Remediation Process
AutoClusters handles two types of detected issues:
Detect-Only Alerts: Some detected issues indicate problems at the software layer that cannot be resolved by node replacement. For these software-level issues, you will be notified via email so you can investigate and take action as necessary. No automatic remediation occurs for these cases.
Hardware Failure Remediation: When AutoClusters detects a hardware failure that requires node replacement, it automatically performs the following steps:
- Verification: AutoClusters confirms that no pods on the node have the
autoclusters.crusoe.ai/remediationPolicy: "Disabled"label. If any pod on the node has this label set to"Disabled", AutoClusters will not remediate the node. - Node Drain: The Kubernetes node is cordoned and drained, which gracefully evicts all pods. Your container
preStophooks are executed at this stage. - Node Replacement: The unhealthy node is removed from the node pool and replaced with a new, healthy node from spare capacity. The system first checks to see if there are any suitable spare nodes in the Crusoe on-demand pool. If none are found, the system proceeds to look for suitable spares in Crusoe's hot spare inventory. If no spares can be found, the Crusoe team is notified to take action. You will also be sent a notification via e-mail to take action as necessary.
- History Logging: A record of the alert and the remediation event is logged and made available in the Crusoe Cloud Console for your review.
Node Remediation History
You can view the history of remediation events in a cluster in the UI Cluster page on the AutoClusters tab next to the Details tab.
Edge Cases and Limitations
While AutoClusters is designed to handle failures automatically, there are some important cases to be aware of:
- Remediation Failures: If the triggered remediation action fails for any reason (e.g. no spare nodes are available to replace the unhealthy one), the remediation process will be aborted, you will be notified via email, and our support team will be notified to assist.
- Excessive Remediations: If a high number of remediation actions are triggered in a short time period, circuit breakers will pause future remediations and our support team will step in to investigate and address underlying causes.
- Conservative Alert Triggers: AutoClusters uses a comprehensive but conservative set of hardware failure detection rules to avoid over-remediation. We only remediate for failures we know require node replacement, ensuring your workloads aren't disrupted unnecessarily.