Skip to main content

Command Center

Command Center provides a unified operations platform for your Crusoe GPU clusters, replacing fragmented monitoring tools with centralized observability, automated alerting, and integrated support workflows.

Why Command Center

Large-scale AI workloads require visibility into every resource in your cluster. Command Center delivers real-time telemetry across your infrastructure, eliminating the need for you to switch between SSH sessions, log dumps, and third-party dashboards.

Key Capabilities

  • View cluster topology — See health and utilization of every node, arranged by network topology.
  • Monitor metrics — Track GPU, CPU, memory, storage, and network performance. Ingest custom application metrics.
  • Access logs — Query Kubernetes pod logs and JournalD system logs without SSH.
  • Export telemetry — Export metrics to Grafana, Datadog, or Splunk via Prometheus-compatible endpoints.
  • Receive alerts — Get notified about hardware failures and cluster events via email, Slack, or webhooks.

Components

Command Center consists of the following components:

ComponentDescriptionAvailability
TopologyVisual cluster topology with GPU utilization, CPU utilization, and node health overlaysCMK only (VM support planned)
MetricsInfrastructure and custom application metrics with Prometheus-compatible API and Crusoe Cloud ConsoleCMK and VM (custom metrics: CMK only)
LogsManaged log collection and search for Kubernetes and system logsCMK only (VM support planned)
Telemetry RelayExport infrastructure metrics to external observability platformsCMK and VM

Prerequisites

To use Command Center, you need:

  • Crusoe Cloud account with an active project
  • CMK cluster (version 1.33.4-cmk.22 or higher) with NVIDIA GPU Operator add-on (if using GPU nodes)
  • Crusoe CLI installed and configured
  • kubectl configured with cluster access
  • helm installed

Getting Started

  1. Deploy a CMK cluster — Follow Managing your Clusters if needed.
  2. Install the Crusoe Watch Agent — See Installing the Crusoe Watch Agent below.
  3. Open Command Center — Navigate to Orchestration > select your cluster > Command Center tab.
  4. Explore your cluster — Start with Topology, then drill into Metrics and Logs.

Installing the Crusoe Watch Agent

The Crusoe Watch Agent is a vector.dev-based DaemonSet that deploys one pod per node. It collects infrastructure metrics, logs, and custom application metrics.

note

Starting with CMK version 1.33.4-cmk.31, the Crusoe Watch Agent (Helm chart version 0.3.7 or higher) is bundled and automatically installed during cluster creation. Each CMK version is associated with a specific Helm chart version. The agent is only installed at cluster creation and will not be automatically updated if the CMK version is upgraded. For clusters on earlier CMK versions, follow the manual installation steps below.

Step 1: Set your Kubernetes context

Target the cluster where you want to install the Crusoe Watch Agent:

crusoe kubernetes clusters get-credentials <cluster-name> --project-id <project-id>
kubectl config current-context

Step 2: Install the agent via Helm

Check the latest agent version:

helm search repo crusoe-watch-agent/crusoe-watch-agent --versions | head -n 2

Install the agent:

helm repo add crusoe-watch-agent https://crusoecloud.github.io/crusoe-watch-agent/k8s/helm-charts
helm repo update
helm install crusoe-watch-agent crusoe-watch-agent/crusoe-watch-agent --namespace crusoe-system

To upgrade an existing installation to the latest version:

helm repo update
helm upgrade crusoe-watch-agent crusoe-watch-agent/crusoe-watch-agent --namespace crusoe-system

Step 3: Verify the agent is running

kubectl get pods -n crusoe-system -l app=crusoe-watch-agent

Confirm that one agent pod is running for every node in your cluster.

Disabling or Customizing the Agent

If you prefer not to use the Crusoe Watch Agent, you can uninstall it or customize its behavior.

To uninstall the agent:

helm uninstall crusoe-watch-agent -n crusoe-system

To customize the agent to collect only specific telemetry:

Configure the agent to collect only metrics or only logs by updating the Helm values. Create a values.yaml file:

# Collect only metrics (disable logs)
metrics:
enabled: true
logs:
enabled: false

To collect only logs:

# Collect only logs (disable metrics)
metrics:
enabled: false
logs:
enabled: true

Then upgrade the agent with your custom configuration:

helm upgrade crusoe-watch-agent crusoe-watch-agent/crusoe-watch-agent --namespace crusoe-system -f values.yaml

Integration with Crusoe Services

Command Center integrates with AutoClusters for automated hardware failure detection and node replacement. Remediation events appear in Notification Center.

What's Next

  • Topology — Monitor cluster health and utilization in a topology-aware view
  • Metrics — Configure and query infrastructure and custom metrics
  • Logs — Search and filter Kubernetes and system logs
  • Telemetry Relay — Export metrics to external platforms
  • Notification — Get notified about resource health via email and in-console, and set up alert routing to Slack or webhooks