Skip to main content

Command Center

Command Center provides a unified operations platform for your Crusoe GPU clusters, replacing fragmented monitoring tools with centralized observability, automated alerting, and integrated support workflows.

Why Command Center

Large-scale AI workloads require visibility into every resource in your cluster. Command Center delivers real-time telemetry across your infrastructure, eliminating the need for you to switch between SSH sessions, log dumps, and third-party dashboards.

Key Capabilities

  • View cluster topology — See health and utilization of every node, arranged by network topology.
  • Monitor metrics — Track GPU, CPU, memory, storage, and network performance. Ingest custom application metrics.
  • Access logs — Query JournalD system logs without SSH.
  • Export telemetry — Export metrics to Grafana, Datadog, or Splunk via Prometheus-compatible endpoints.
  • Receive alerts — Get notified about hardware failures and cluster events via email, Slack, or webhooks.

Components

Command Center consists of the following components:

ComponentDescriptionAvailability
TopologyVisual cluster topology with GPU utilization, CPU utilization, and node health overlaysCMK only
MetricsInfrastructure and custom application metrics with Prometheus-compatible API and Crusoe Cloud ConsoleCMK and VM (custom metrics: CMK only)
LogsManaged log collection and search for Kubernetes and system logsCMK and VM
Telemetry RelayExport infrastructure metrics to external observability platformsCMK and VM

Prerequisites

To use Command Center, you need:

  • Crusoe Cloud account with an active project
  • Crusoe CLI installed and configured
  • kubectl configured with cluster access if you are a CMK user
  • helm installed if you are a CMK user

Getting Started

Command Center requires the Crusoe Watch Agent to collect telemetry from your infrastructure. The setup process differs between CMK clusters and VMs.

CMK Clusters

  1. Deploy a CMK cluster — Follow Managing your Clusters if needed.
  2. Install the Crusoe Watch Agent — For CMK version 1.33.4-cmk.31+, the agent is automatically installed. For earlier versions, see Manual Installation for CMK below.
  3. Access Command Center — Navigate to Orchestration > select your cluster.
  4. Explore — Start with Topology, then drill into Metrics and Logs.
note

Starting with CMK version 1.33.4-cmk.31, the Crusoe Watch Agent (Helm chart version 0.3.7 or higher) is automatically installed during cluster creation. Only follow these steps if you have an earlier CMK version or need to manually install/upgrade the agent.

Helm chart version 0.3.12 or higher is required for in-console NVIDIA bug report generation on CMK clusters.

Virtual Machines

  1. Install the Crusoe Watch Agent — Follow Virtual Machines Metrics for manual installation or Ansible deployment.
  2. Access Command Center — Navigate to Compute > select your VM > Metrics or navgiate to Managed Logs in left navigation bar.
  3. Explore — View Metrics and Logs for your VM.
note

Crusoe Watch Agent version 1.0.3 or higher is required for in-console NVIDIA bug report generation and is required for managed logs on VMs.

Manual Installation for CMK

For clusters running CMK versions earlier than 1.33.4-cmk.31, or if you need to manually install or upgrade the Crusoe Watch Agent, follow the detailed instructions in CMK Metrics - Installing Crusoe Watch Agent.

Disabling or Customizing the Agent

If you prefer not to use the Crusoe Watch Agent, you can uninstall it or customize its behavior.

Uninstalling the Agent

For CMK users:

helm uninstall crusoe-watch-agent -n crusoe-system

For VM users:

To manually delete the Docker container hosting the crusoe-watch-agent:

docker stop crusoe-watch-agent
docker rm crusoe-watch-agent

Customizing the Agent

You can configure the agent to collect only specific telemetry (metrics only or logs only).

Create a values.yaml file:

# Collect only metrics (disable logs)
metrics:
enabled: true
logs:
enabled: false

To collect only logs:

# Collect only logs (disable metrics)
metrics:
enabled: false
logs:
enabled: true

Then upgrade the agent with your custom configuration. For CMK, use helm upgrade:

helm upgrade crusoe-watch-agent crusoe-watch-agent/crusoe-watch-agent --namespace crusoe-system -f values.yaml

Integration with Crusoe Services

Command Center integrates with AutoClusters for automated hardware failure detection and node replacement for CMK clusters. Remediation events appear in Notification Center.

What's Next

  • Topology — Monitor cluster health and utilization in a topology-aware view
  • Metrics — Configure and query infrastructure and custom metrics
  • Logs — Search and filter Kubernetes and system logs
  • Telemetry Relay — Export metrics to external platforms
  • Notification — Get notified about resource health via email and in-console, and set up alert routing to Slack or webhooks