Get started
This page covers the setup steps shared across Command Center features: installing the Crusoe Watch Agent, generating a monitoring token, and the access methods available.
Crusoe Watch Agent
The Crusoe Watch Agent collects telemetry from your infrastructure. It must be installed before metrics, logs, health status, and alerting features become available.
- CMK Clusters
- VMs
CMK version 1.33.4-cmk.31 and later automatically install the agent at cluster creation. For earlier versions, or to upgrade an existing installation, follow these steps.
Step 1: Switch kubectl context
Make sure the NVIDIA GPU Operator add-on is enabled on your cluster (required for NVIDIA GPU accelerated instances) or the AMD GPU Operator add-on is enabled (required for AMD GPU accelerated instances), then switch your kubectl context to the target cluster:
crusoe kubernetes clusters get-credentials <cluster-name> --project-id <project-id>
Step 2: Install or upgrade the agent
- Install
- Upgrade
helm repo add crusoe-watch-agent https://crusoecloud.github.io/crusoe-watch-agent/k8s/helm-charts
helm repo update
helm install crusoe-watch-agent crusoe-watch-agent/crusoe-watch-agent --namespace crusoe-system
helm repo update
helm upgrade crusoe-watch-agent crusoe-watch-agent/crusoe-watch-agent --namespace crusoe-system
Step 3: Verify the installation
kubectl get pods -n crusoe-system
Supported Kubernetes versions: 1.32, 1.33, 1.34. Run crusoe kubernetes clusters list-versions to see the latest available patch versions.
Enable the agent at VM creation time (recommended) or install it after provisioning.
At VM creation:
- Console: Toggle Enable Observability on Step 1 (Instance Details) of the VM creation workflow.
- CLI: Pass
--install-watch-agent=true. - Terraform: Set
install_crusoe_watch_agent = true.
A monitoring token is automatically provisioned for the agent when using this method.
Install after VM provisioning: Generate a monitoring token first (see Generate a monitoring token), then use the Ansible Deployment Guide or the manual installation instructions.
Supported images:
- NVIDIA GPU:
ubuntu22.04-nvidia-sxm-docker,ubuntu22.04-nvidia-pcie-docker - AMD GPU:
ubuntu22.04with ROCm 6.2.0 or later - Non-GPU:
ubuntu22.04,ubuntu24.04
CMK Capability by Helm Chart Version
| Capability | Minimum Helm Chart Version |
|---|---|
| NVIDIA GPU metrics | 0.2.6 |
| AMD GPU metrics | 0.2.7 |
| Managed logs (JournalD, kubelet, container) | 0.3.2 |
| Slurm metrics | 0.3.11 |
| Custom pod metrics | 0.3.11 |
| NVIDIA bug reports | 0.3.12 |
| AMD GPU logs | 0.3.19 |
VM Capability by Agent Version
| Capability | Minimum VM Agent Version |
|---|---|
| NVIDIA GPU metrics | 1.0.0 |
| Managed logs (JournalD) | 1.0.1 |
| NVIDIA bug reports | 1.0.1 |
| AMD GPU metrics | 1.0.3 |
| AMD GPU monitoring and logs | 1.0.5 |
Disabling the Agent
You can uninstall the Crusoe Watch Agent or disable some of its capabilities if needed. Keep in mind that turning it off means Crusoe support will not have visibility into your infrastructure health, utilization, or performance data, which may limit our ability to proactively identify issues or provide the in-depth assistance you need.
CMK (full uninstall):
helm uninstall crusoe-watch-agent -n crusoe-system
VM (full uninstall):
docker stop crusoe-watch-agent
docker rm crusoe-watch-agent
CMK and VM (collect only metrics or only logs): Create a values.yaml:
# Collect only metrics (disable logs)
metrics:
enabled: true
logs:
enabled: false
Apply for CMK with:
helm upgrade crusoe-watch-agent crusoe-watch-agent/crusoe-watch-agent --namespace crusoe-system -f values.yaml
Apply for VMs — download and re-run the installer with the same flags, passing the values file:
bash crusoe_watch_agent.sh --values values.yaml
Generate a monitoring token
A monitoring token is required to query metrics or logs via API, import data into Grafana, or use Telemetry Conduit. When the agent is installed at VM creation, a token is automatically provisioned for the agent; you still need to generate one separately to query the API manually.
crusoe monitoring tokens create
Store the token securely. You cannot retrieve it later.
If the token contains special characters such as $, reference it from a Kubernetes Secret using secretKeyRef to avoid parsing errors in Helm deployments.
Access Methods
Console
To access metrics, logs, and health status directly in the Crusoe Cloud Console:
- Metrics and health status: Navigate to Command Center in the left navigation, and then select Infrastructure Overview.
- Project-level logs: Navigate to Command Center in the left navigation, and then select Managed Logs.
- Resource-specific metrics and logs: Use a specific VM's Metrics or Logs tab under Compute, or a CMK cluster's tab under Orchestration.
API
Metrics (PromQL): Query metrics via the Prometheus-compatible endpoint:
https://api.cloud.crusoe.ai/v1/projects/<project-id>/metrics/timeseries
Logs (LogsQL): Query logs via:
https://api.crusoecloud.com/v1/projects/<project-id>/logs
Authenticate with your monitoring token as a Bearer token. See Logs for the full endpoint reference.
Grafana
Add a Prometheus data source to your Grafana instance pointing to the metrics API endpoint, with an Authorization: Bearer <token> HTTP header.
Pre-built Grafana dashboard templates for CMK and Managed Slurm clusters are available in the Crusoe solutions library. Templates cover GPU utilization, InfiniBand fabric health, power draw, XID error tracking, Slurm job performance, storage, and network.
Telemetry Conduit
Export metrics continuously to Grafana, Datadog, or Splunk via a Prometheus-compatible scraping endpoint. See Telemetry Conduit for setup.
Crusoe MCP
You can access metrics and logs data collected by the Command Center via Crusoe MCP. See the Crusoe MCP page for setup instructions.