AutoClusters
AutoClusters is currently in Limited Availability. If you would like to join the Limited Availability program, please contact customer support to request access.
Automated Node Remediation with AutoClusters
AutoClusters enhances the resilience of your CMK workloads by automatically detecting and resolving common hardware failures. By enabling AutoClusters you can minimize downtime and reduce the need for manual intervention, ensuring higher effective utilization for your clusters.
This guide explains how AutoClusters works, how to enable it for your deployments, and what to expect during the automated remediation process.
Supported Versions and Hardware
AutoClusters is currently supported on the following minimum CMK versions:
1.33.4-cmk.131.32.7-cmk.161.31.7-cmk.19
Future minor and patch releases within these Kubernetes versions (e.g., 1.33.5-cmk.X, 1.32.8-cmk.X) are also supported.
AutoClusters can remediate issues for Kubernetes nodes running on the following Crusoe GPU instance types:
- 10x NVIDIA L40S (
l40s-48gb.10x) - 8x Nvidia A40 48GB (
a40.8x) - 8x Nvidia A100 80GB (
a100-80gb.8x,a100-80gb-sxm-ib.8x) - 8x Nvidia H100 80GB (
h100-80gb-sxm-ib.8x) - 8x NVIDIA H200 141GB (
h200-141gb-sxm-ib.8x) - 8x NVIDIA B200 180GB (
b200-180gb-sxm-ib-8x) - 4x NVIDIA GB200 186GB (
gb200-186gb-nvl-4x)
AutoClusters does not support remediation on multi-tenant VM types where multiple Kubernetes nodes may be co-located on the same physical host. In those configurations, node replacement cannot be safely and deterministically executed, and AutoClusters will not trigger remediation.
How it Works
AutoClusters continuously monitors your infrastructure for hardware-related errors. When a critical issue is detected on a node, AutoClusters initiates a remediation process based on standard Kubernetes procedures.
The process involves:
- Graceful Termination: Your workloads are given time to shut down cleanly.
- Node Restart or Replacement: The unhealthy node is either restarted or removed from the cluster and replaced with a healthy one.
- Workload Rescheduling: Your pods are automatically rescheduled onto the new node.
This entire process is automated, allowing your workloads to recover from hardware failures without manual intervention.
Enabling AutoClusters
Enabling AutoClusters for your workloads is a multi-step process. First, you need to ensure the AutoClusters addon is enabled for your cluster. Then, you can configure your individual workloads to use it.
Prerequisite: Enable the AutoClusters Addon
AutoClusters is a Kubernetes add-on that must be enabled at the cluster level.
- For new clusters: You can enable the AutoClusters addon during the cluster creation process through the UI by selecting the AutoClusters add-on.
- For existing clusters: Please contact our support team to have the AutoClusters addon enabled for your existing cluster.
Once the addon is enabled for your cluster, you can proceed with the following steps to configure your workloads for automated remediation.
Step 1: Add the AutoClusters Label to Your Pod Template
To opt a workload into automated remediation, add the autoclusters.crusoe.ai/remediationPolicy: "Enabled" label to the pod template of your resource (e.g., a Job or Deployment).
Here's an example for a Kubernetes Deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: your-deployment-name
spec:
# ...
template:
metadata:
labels:
autoclusters.crusoe.ai/remediationPolicy: "Enabled"
# ...
All pods created from this template will inherit the label, making them eligible for automated remediation. If this label is not present or not set to "Enabled" for any Pods running on a Kubernetes node, AutoClusters will not take any action on the node.
Step 2: Implement a preStop Hook
You should implement a preStop hook in any containers that require extra steps to shut down gracefully. This hook is executed on the container before the pod is terminated, giving your application time to save its state, checkpoint its work, or close any open connections.
Here's an example of a preStop hook that runs a shell script to save state:
spec:
template:
# ...
spec:
containers:
- name: my-container
image: my-image
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "/app/your-save-state-script.sh"]
# Optionally set a grace period for your hook to complete. If you do not set this value, the default grace period of 30 seconds will apply.
terminationGracePeriodSeconds: 120
# ...
In this example, your-save-state-script.sh is a script you provide to handle the graceful shutdown of your application.
The Remediation Process
When AutoClusters detects an error on a node running one of your opted-in pods, it automatically performs the following steps:
- Verification: AutoClusters confirms that at least one pod on the node has the
autoclusters.crusoe.ai/remediationPolicy: "Enabled"label. If this label is not present or not set to"Enabled"for any Pods running on a Kubernetes node, AutoClusters will not take any action on the node. - Node Drain: The Kubernetes node is cordoned and drained, which gracefully evicts all pods. Your container
preStophooks are executed at this stage. - Node Replacement: The unhealthy node is removed from the node pool and replaced with a new, healthy node from spare capacity. The system first checks to see if there are any suitable spare nodes in your project's reservation. If none are found, the system proceeds to look for suitable spares in Crusoe's spare inventory. This gives customers the option to leave spare capacity in their projects to guarantee availability of replacement nodes.
- History Logging: A record of the alert and the remediation event is logged and made available in the Crusoe Cloud Console for your review.
Node Remediation History
You can view the history of remediation events in a cluster in the UI Cluster page on the Remediations tab next to the Details tab.
Edge Cases and Limitations
While AutoClusters is designed to handle failures automatically, there are some important cases to be aware of:
- Remediation Failures: If the triggered remediation action fails for any reason (e.g. no spare nodes are available to replace the unhealthy one), the remediation process will be aborted, you will be notified via email, and our support team will be notified to assist.
- Excessive Remediations: If a high number of remediation actions are triggered in a short time period, future remediations will be paused, you will be notified via email, and our support team will be notified to investigate possible underlying causes.
- Limited Alert Triggers: Different workloads may trigger different hardware failure modes. To limit improper remediations due to possible alert noise, we will start with a constrained set of alerts on well-known GPU failure modes (e.g. NVIDIA XID errors). We will enhance our suite of alerts and corresponding remediations as the Limited Availability period progresses.