Command Center
Command Center provides a unified operations platform for your Crusoe GPU clusters, replacing fragmented monitoring tools with centralized observability, automated alerting, and integrated support workflows.
Why Command Center
Large-scale AI workloads require visibility into every resource in your cluster. Command Center delivers real-time telemetry across your infrastructure, eliminating the need for you to switch between SSH sessions, log dumps, and third-party dashboards.
Key Capabilities
- View cluster topology — See health and utilization of every node, arranged by network topology.
- Monitor metrics — Track GPU, CPU, memory, storage, and network performance. Ingest custom application metrics.
- Access logs — Query Kubernetes pod logs and JournalD system logs without SSH.
- Export telemetry — Export metrics to Grafana, Datadog, or Splunk via Prometheus-compatible endpoints.
- Receive alerts — Get notified about hardware failures and cluster events via email, Slack, or webhooks.
Components
Command Center consists of the following components:
| Component | Description | Availability |
|---|---|---|
| Topology | Visual cluster topology with GPU utilization, CPU utilization, and node health overlays | CMK only (VM support planned) |
| Metrics | Infrastructure and custom application metrics with Prometheus-compatible API and Crusoe Cloud Console | CMK and VM (custom metrics: CMK only) |
| Logs | Managed log collection and search for Kubernetes and system logs | CMK only (VM support planned) |
| Telemetry Relay | Export infrastructure metrics to external observability platforms | CMK and VM |
Prerequisites
To use Command Center, you need:
- Crusoe Cloud account with an active project
- CMK cluster (version 1.33.4-cmk.22 or higher) with NVIDIA GPU Operator add-on (if using GPU nodes)
- Crusoe CLI installed and configured
kubectlconfigured with cluster accesshelminstalled
Getting Started
- Deploy a CMK cluster — Follow Managing your Clusters if needed.
- Install the Crusoe Watch Agent — See Installing the Crusoe Watch Agent below.
- Open Command Center — Navigate to Orchestration > select your cluster > Command Center tab.
- Explore your cluster — Start with Topology, then drill into Metrics and Logs.
Installing the Crusoe Watch Agent
The Crusoe Watch Agent is a vector.dev-based DaemonSet that deploys one pod per node. It collects infrastructure metrics, logs, and custom application metrics.
Starting with CMK version 1.33.4-cmk.31, the Crusoe Watch Agent (Helm chart version 0.3.7 or higher) is bundled and automatically installed during cluster creation. Each CMK version is associated with a specific Helm chart version. The agent is only installed at cluster creation and will not be automatically updated if the CMK version is upgraded. For clusters on earlier CMK versions, follow the manual installation steps below.
Step 1: Set your Kubernetes context
Target the cluster where you want to install the Crusoe Watch Agent:
crusoe kubernetes clusters get-credentials <cluster-name> --project-id <project-id>
kubectl config current-context
Step 2: Install the agent via Helm
Check the latest agent version:
helm search repo crusoe-watch-agent/crusoe-watch-agent --versions | head -n 2
Install the agent:
helm repo add crusoe-watch-agent https://crusoecloud.github.io/crusoe-watch-agent/k8s/helm-charts
helm repo update
helm install crusoe-watch-agent crusoe-watch-agent/crusoe-watch-agent --namespace crusoe-system
To upgrade an existing installation to the latest version:
helm repo update
helm upgrade crusoe-watch-agent crusoe-watch-agent/crusoe-watch-agent --namespace crusoe-system
Step 3: Verify the agent is running
kubectl get pods -n crusoe-system -l app=crusoe-watch-agent
Confirm that one agent pod is running for every node in your cluster.
Disabling or Customizing the Agent
If you prefer not to use the Crusoe Watch Agent, you can uninstall it or customize its behavior.
To uninstall the agent:
helm uninstall crusoe-watch-agent -n crusoe-system
To customize the agent to collect only specific telemetry:
Configure the agent to collect only metrics or only logs by updating the Helm values. Create a values.yaml file:
# Collect only metrics (disable logs)
metrics:
enabled: true
logs:
enabled: false
To collect only logs:
# Collect only logs (disable metrics)
metrics:
enabled: false
logs:
enabled: true
Then upgrade the agent with your custom configuration:
helm upgrade crusoe-watch-agent crusoe-watch-agent/crusoe-watch-agent --namespace crusoe-system -f values.yaml
Integration with Crusoe Services
Command Center integrates with AutoClusters for automated hardware failure detection and node replacement. Remediation events appear in Notification Center.
What's Next
- Topology — Monitor cluster health and utilization in a topology-aware view
- Metrics — Configure and query infrastructure and custom metrics
- Logs — Search and filter Kubernetes and system logs
- Telemetry Relay — Export metrics to external platforms
- Notification — Get notified about resource health via email and in-console, and set up alert routing to Slack or webhooks