Skip to main content

Notifications

Our notification service routes alerts from across your Crusoe Cloud organization to the channels your team already uses. When important events occur, such as infrastructure failures detected by Command Center, budget thresholds set via Billing Alerts are reached, or automated remediation actions by AutoClusters, the notification service delivers structured notifications to your configured destinations, including the Crusoe Cloud Console, email, Slack, and webhook endpoints.

Notifications are available for all Crusoe Cloud users.

How it Works

Crusoe Cloud continuously monitors your infrastructure, usage, and services. When a significant event is detected, such as a GPU hardware error, a budget threshold being exceeded, or a node replacement, the event is published to a notification pipeline. The notification service consumes these events and routes them to your configured channels with relevant context, including affected resources, event descriptions, actions taken, and links to investigate further in the Crusoe Cloud Console (if applicable).

Accessing Your Notifications

Console Notifications

All notifications are displayed in the Crusoe Cloud Console by default. To view them:

  1. Click on the bell icon in the top right corner to view the latest, unread notifications from the side bar. You can also dismiss notifications from this view.
  2. Select All Notifications from the side bar to view your complete notification history. Note that previously dismissed notifications will still show up in this view.

Email Notifications

Email notifications are sent automatically to relevant team memebers when a critical event occurs.

For critical resource health alerts, emails are sent to all users in an organization and will include:

  • VM ID and VM name of the affected node
  • Cluster name and cluster ID
  • A brief description of the detected issue and the action taken
  • A link to the relevant view in Crusoe Cloud Console (requires authentication)
note

Email notifications are enabled by default. No additional configuration is required.

Slack and Webhook Notifications

To receive Slack and webhook notifications, you must first configure them. See the following sections on how to configure these delivery channels.

Configuring Slack Notifications

You can route notifications to a Slack channel so your on-call team sees alerts in real time.

  1. In the Crusoe Cloud Console, click on the bell icon in the top right corner.
  2. Click All Notifications, then click Configure Slack/Webhook Notifications in the top right corner. This will link out to a separate page.
  3. On the new page, click Add Endpoint.
  4. Select Slack as the endpoint type.
  5. Provide your Slack incoming webhook URL or click Connect to Slack (requires authentication). To generate a webhook URL, follow the Slack documentation on incoming webhooks.
  6. Select which event types to subscribe to.
  7. Click Create.
note

If the authorization to Connect to Slack or the incoming webhook URL requires approval from your enterprise Slack account, please reach out to your IT department to authorize the Svix app for your account.

Slack notifications include the same structured information as email notifications: affected resources, event description, action taken, and a link to the Crusoe Cloud Console (if applicable).

Configuring Webhook Notifications

For integration with PagerDuty, Opsgenie, custom automation, or other tools, you can configure a generic webhook endpoint.

  1. In the Crusoe Cloud Console, click on the bell icon in the top right corner.
  2. Click All Notifications, then click Configure Slack/Webhook Notifications in the top right corner. This will link out to a separate page.
  3. Click Add Endpoint.
  4. Select Webhook as the endpoint type.
  5. Provide your webhook endpoint URL.
  6. Select which event types to subscribe to.
  7. Click Create.

Webhook payloads are delivered as HTTP POST requests with a JSON body containing the event details. The exact structure varies by event type. An example JSON body for an Autoclusters node replacement event alert is as follows:

{
"event_type": "node_replacement_initiated",
"cluster_id": "cluster-abc123",
"cluster_name": "training-cluster-prod",
"vm_id": "vm-xyz789",
"vm_name": "worker-node-42",
"error_code": "XID 79",
"description": "GPU has fallen off the PCIe bus. AutoClusters has initiated node replacement.",
"action": "REPLACE_NODE",
"timestamp": "2026-02-11T14:32:00Z",
"command_center_url": "https://console.crusoecloud.com/orchestration/clusters/cluster-abc123/command-center"
}

Notification Events

Crusoe Cloud generates alerts for the following event categories.

AutoClusters Remediation Events

If your cluster has AutoClusters enabled, you receive notifications when:

EventDescription
Node replacement initiatedAutoClusters has detected a critical hardware failure and started the remediation process. Includes the error code and affected node.
Node replacement completedA faulty node has been successfully drained, removed, and replaced with a healthy node from the spare pool.
Node replacement failedThe remediation process could not complete — for example, if no spare nodes are available. Crusoe Cloud Support is automatically notified.
Detection onlyA hardware issue was detected but does not meet the threshold for automatic remediation. The notification includes the error code for your review.

Critical error codes that trigger node replacement:

Error CodeDescription
GPUFellOffTheBusGPU lost from PCIe bus
HCAFellOffTheBusHost Channel Adapter (InfiniBand) lost
XID 48Uncorrectable double-bit ECC memory error
XID 64GPU memory error recovery failure
XID 74NVLink interconnect error
XID 79GPU fell off the PCIe bus
XID 119GSP not responding to driver RPC requests
XID 120Driver failed to recover from GSP timeout

Detection-only error codes (no automatic remediation):

Error CodeDescription
XID 76Internal micro-controller breakpoint
XID 94Contained ECC error
XID 95Uncontained ECC error
XID 137Unexpected completion
XID 140Unrecovered ECC error
XID 143GPU initialization failure

Critical Failure Events

For nodes not covered by AutoClusters (for example, non-GPU nodes or nodes in clusters without AutoClusters enabled), you will be notified about critical hardware failures that require manual intervention.

Billing Alerts

Billing alerts help you monitor and control your cloud spending by notifying you when costs reach predefined thresholds.

Additional Notification Types

Notifications and alerting will continue to expand with support for additional event types, including inference service notifications and organization-level alerts.

Notification Details for Resource Health Alerts

For resource health alerts, each notification includes the following information:

FieldDescription
VM IDThe unique identifier of the affected node
VM nameThe human-readable name of the affected node
Cluster nameThe cluster containing the affected node
Cluster IDThe unique identifier of the cluster
Event typeThe category of the event (replacement, maintenance, failure)
Error codeThe specific hardware error code, if applicable
DescriptionA brief explanation of what was detected
Action takenThe remediation action (automatic or manual)
TimestampWhen the event was detected
Crusoe Cloud ConsoleA direct link to the relevant view in Crusoe Cloud Console

What's Next

  • Topology — Investigate affected nodes in the cluster topology
  • Logs — Review system logs for the nodes referenced in notifications
  • Metrics — Check performance trends around the time of the event
  • AutoClusters — Learn more about automated hardware remediation