Skip to main content

Overview

Command Center provides a unified operations platform for your Crusoe GPU clusters, replacing fragmented monitoring tools with centralized observability, automated alerting, and integrated support workflows.

Why Command Center

Large-scale AI workloads require visibility into every resource in your cluster. Command Center delivers real-time telemetry across your infrastructure, eliminating the need for you to switch between SSH sessions, log dumps, and third-party dashboards.

Key Capabilities

  • Infrastructure Overview — See a project-level fleet summary grouped by instance type, with GPU utilization, health status, and drill-down to cluster or VM details.
  • View cluster topology — See health and utilization of every node, arranged by network topology.
  • Monitor metrics — Track GPU, CPU, memory, storage, and network performance. Ingest custom application metrics.
  • Access logs — Query JournalD system logs without SSH.
  • Export telemetry — Export metrics to Grafana, Datadog, or Splunk via Prometheus-compatible endpoints.
  • Receive alerts — Get notified about hardware failures and cluster events via email, Slack, or webhooks.

Components

Command Center consists of the following components:

ComponentDescriptionAvailability
Infrastructure and Topology OverviewProject-level fleet summary grouped by instance type with health and GPU utilization, plus visual cluster topology with node health overlaysCMK and VM (CWA required); topology view: CMK only
MetricsInfrastructure and custom application metrics with Prometheus-compatible API and Crusoe Cloud ConsoleCMK and VM (custom metrics: CMK only)
LogsManaged log collection and search for Kubernetes and system logsCMK and VM
Instance HealthHealthy, Degraded, or Unhealthy status for CMK nodes and standalone VMs, derived from GPU telemetry and lifecycle eventsCMK and VM (CWA required)
AlertsGet notified about hardware failures and cluster events via email, Slack, or webhooksCMK and VM
Telemetry ConduitExport infrastructure metrics to external observability platformsCMK and VM
Natural Language QueryQuery infrastructure metrics and logs in plain English via Crusoe MCPCMK and VM (see Crusoe MCP)

Prerequisites

To use Command Center, you need:

  • Crusoe Cloud account with an active project
  • Crusoe CLI installed and configured
  • kubectl configured with cluster access if you are a CMK user
  • helm installed if you are a CMK user

Get Started

Command Center requires the Crusoe Watch Agent to collect telemetry from your infrastructure. For installation instructions, token generation, and access method details, see Get Started.

Integration with Crusoe Services

Command Center integrates with AutoClusters for automated hardware failure detection and node replacement for CMK clusters. Remediation events appear in Notification Center.

For GPU XID error alerts on standalone VMs and CMK nodes, see Notifications.

What's Next

  • Infrastructure and Topology Overview — View fleet-level health and utilization, and drill into cluster topology
  • Instance Health — Understand health status categories and error codes
  • Metrics — Configure and query infrastructure and custom metrics
  • Logs — Search and filter Kubernetes and system logs
  • Telemetry Conduit — Export metrics to external platforms
  • Notification — Get notified about resource health via email and in-console, and set up alert routing to Slack or webhooks