Notifications

Our notification service routes alerts from across your Crusoe Cloud organization to the channels your team already uses. When important events occur, such as infrastructure failures detected by Command Center, budget thresholds set via Budget Alerts are reached, or automated remediation actions by AutoClusters, the notification service delivers structured notifications to your configured destinations, including the Crusoe Cloud Console, email, Slack, and webhook endpoints.

Notifications are available for all Crusoe Cloud users.

How it Works

Crusoe Cloud continuously monitors your infrastructure, usage, and services. When a significant event is detected, such as a GPU hardware error, a budget threshold being exceeded, or a node replacement, the event is published to a notification pipeline. The notification service consumes these events and routes them to your configured channels with relevant context, including affected resources, event descriptions, actions taken, and links to investigate further in the Crusoe Cloud Console (if applicable).

Accessing Your Notifications

Console Notifications

All notifications are displayed in the Crusoe Cloud Console by default. To view them:

Click on the bell icon in the top right corner to view the latest, unread notifications from the side bar. You can also dismiss notifications from this view.
Select All Notifications from the side bar to view your complete notification history. Note that previously dismissed notifications will still show up in this view.

Email Notifications

Email notifications are sent automatically to relevant team memebers when a critical event occurs.

For critical resource health alerts, emails are sent to all users in an organization and will include:

VM ID and VM name of the affected node
Cluster name and cluster ID
A brief description of the detected issue and the action taken
A link to the relevant view in Crusoe Cloud Console (requires authentication)

note

Email notifications are enabled by default. No additional configuration is required.

Slack and Webhook Notifications

To receive Slack and webhook notifications, you must first configure them. See the following sections on how to configure these delivery channels.

Configuring Slack Notifications

You can route notifications to a Slack channel so your on-call team sees alerts in real time.

In the Crusoe Cloud Console, click on the bell icon in the top right corner.
Click All Notifications, then click Configure Slack/Webhook Notifications in the top right corner. This will link out to a separate page.
On the new page, click Add Endpoint.
Select Slack as the endpoint type.
Provide your Slack incoming webhook URL or click Connect to Slack (requires authentication). To generate a webhook URL, follow the Slack documentation on incoming webhooks.
Select which event types to subscribe to.
Click Create.

note

If the authorization to Connect to Slack or the incoming webhook URL requires approval from your enterprise Slack account, please reach out to your IT department to authorize the Svix app for your account.

Slack notifications include the same structured information as email notifications: affected resources, event description, action taken, and a link to the Crusoe Cloud Console (if applicable).

Configuring Webhook Notifications

For integration with PagerDuty, Opsgenie, custom automation, or other tools, you can configure a generic webhook endpoint.

In the Crusoe Cloud Console, click on the bell icon in the top right corner.
Click All Notifications, then click Configure Slack/Webhook Notifications in the top right corner. This will link out to a separate page.
Click Add Endpoint.
Select Webhook as the endpoint type.
Provide your webhook endpoint URL.
Select which event types to subscribe to.
Click Create.

Webhook payloads are delivered as HTTP POST requests with a JSON body containing the event details. The exact structure varies by event type. An example JSON body for an Autoclusters node replacement event alert is as follows:

{
  "event_type": "node_replacement_initiated",
  "cluster_id": "cluster-abc123",
  "cluster_name": "training-cluster-prod",
  "vm_id": "vm-xyz789",
  "vm_name": "worker-node-42",
  "error_code": "XID 79",
  "description": "GPU has fallen off the PCIe bus. AutoClusters has initiated node replacement.",
  "action": "REPLACE_NODE",
  "timestamp": "2026-02-11T14:32:00Z",
  "command_center_url": "https://console.crusoecloud.com/orchestration/clusters/cluster-abc123/command-center"
}

Notification Events

Crusoe Cloud generates alerts for the following event categories.

AutoClusters Remediation Events

If your cluster has AutoClusters enabled, you receive notifications when:

Event	Description
Node replacement initiated	AutoClusters has detected a critical hardware failure and started the remediation process. Includes the error code and affected node.
Node replacement completed	A faulty node has been successfully drained, removed, and replaced with a healthy node from the spare pool.
Node replacement failed	The remediation process could not complete — for example, if no spare nodes are available. Crusoe Cloud Support is automatically notified.
Detection only	A hardware issue was detected but does not meet the threshold for automatic remediation. The notification includes the error code for your review.

Critical error codes that trigger node replacement:

Error Code	Description
GPUFellOffTheBus	GPU lost from PCIe bus
HCAFellOffTheBus	Host Channel Adapter (InfiniBand) lost
XID 48	Uncorrectable double-bit ECC memory error
XID 64	GPU memory error recovery failure
XID 74	NVLink interconnect error
XID 79	GPU fell off the PCIe bus
XID 119	GSP not responding to driver RPC requests
XID 120	Driver failed to recover from GSP timeout

Detection-only error codes (no automatic remediation):

Error Code	Description
XID 76	Internal micro-controller breakpoint
XID 94	Contained ECC error
XID 95	Uncontained ECC error
XID 137	Unexpected completion
XID 140	Unrecovered ECC error
XID 143	GPU initialization failure

Critical Failure Events

For nodes not covered by AutoClusters (for example, non-GPU nodes or nodes in clusters without AutoClusters enabled), you will be notified about critical hardware failures that require manual intervention.

Budget Alerts

Budget alerts help you monitor and control your cloud spending by notifying you when costs reach predefined thresholds.

Additional Notification Types

Notifications and alerting will continue to expand with support for additional event types, including inference service notifications and organization-level alerts.

Notification Details for Resource Health Alerts

For resource health alerts, each notification includes the following information:

Field	Description
VM ID	The unique identifier of the affected node
VM name	The human-readable name of the affected node
Cluster name	The cluster containing the affected node
Cluster ID	The unique identifier of the cluster
Event type	The category of the event (replacement, maintenance, failure)
Error code	The specific hardware error code, if applicable
Description	A brief explanation of what was detected
Action taken	The remediation action (automatic or manual)
Timestamp	When the event was detected
Crusoe Cloud Console	A direct link to the relevant view in Crusoe Cloud Console

What's Next

Topology — Investigate affected nodes in the cluster topology
Logs — Review system logs for the nodes referenced in notifications
Metrics — Check performance trends around the time of the event
AutoClusters — Learn more about automated hardware remediation

Notifications

How it Works​

Accessing Your Notifications​

Console Notifications​

Email Notifications​

Slack and Webhook Notifications​

Configuring Slack Notifications​

Configuring Webhook Notifications​

Notification Events​

AutoClusters Remediation Events​

Critical Failure Events​

Budget Alerts​

Additional Notification Types​

Notification Details for Resource Health Alerts​

What's Next​