Skip to main content

Get started

This page covers the setup steps shared across Command Center features: installing the Crusoe Watch Agent, generating a monitoring token, and the access methods available.

Crusoe Watch Agent

The Crusoe Watch Agent collects telemetry from your infrastructure. It must be installed before metrics, logs, health status, and alerting features become available.

CMK version 1.33.4-cmk.31 and later automatically install the agent at cluster creation. For earlier versions, or to upgrade an existing installation, follow these steps.

Step 1: Switch kubectl context

Make sure the NVIDIA GPU Operator add-on is enabled on your cluster (required for NVIDIA GPU accelerated instances) or the AMD GPU Operator add-on is enabled (required for AMD GPU accelerated instances), then switch your kubectl context to the target cluster:

crusoe kubernetes clusters get-credentials <cluster-name> --project-id <project-id>

Step 2: Install or upgrade the agent

helm repo add crusoe-watch-agent https://crusoecloud.github.io/crusoe-watch-agent/k8s/helm-charts
helm repo update
helm install crusoe-watch-agent crusoe-watch-agent/crusoe-watch-agent --namespace crusoe-system

Step 3: Verify the installation

kubectl get pods -n crusoe-system

Supported Kubernetes versions: 1.32, 1.33, 1.34. Run crusoe kubernetes clusters list-versions to see the latest available patch versions.

CMK Capability by Helm Chart Version

CapabilityMinimum Helm Chart Version
NVIDIA GPU metrics0.2.6
AMD GPU metrics0.2.7
Managed logs (JournalD, kubelet, container)0.3.2
Slurm metrics0.3.11
Custom pod metrics0.3.11
NVIDIA bug reports0.3.12
AMD GPU logs0.3.19

VM Capability by Agent Version

CapabilityMinimum VM Agent Version
NVIDIA GPU metrics1.0.0
Managed logs (JournalD)1.0.1
NVIDIA bug reports1.0.1
AMD GPU metrics1.0.3
AMD GPU monitoring and logs1.0.5

Disabling the Agent

You can uninstall the Crusoe Watch Agent or disable some of its capabilities if needed. Keep in mind that turning it off means Crusoe support will not have visibility into your infrastructure health, utilization, or performance data, which may limit our ability to proactively identify issues or provide the in-depth assistance you need.

CMK (full uninstall):

helm uninstall crusoe-watch-agent -n crusoe-system

VM (full uninstall):

docker stop crusoe-watch-agent
docker rm crusoe-watch-agent

CMK and VM (collect only metrics or only logs): Create a values.yaml:

# Collect only metrics (disable logs)
metrics:
enabled: true
logs:
enabled: false

Apply for CMK with:

helm upgrade crusoe-watch-agent crusoe-watch-agent/crusoe-watch-agent --namespace crusoe-system -f values.yaml

Apply for VMs — download and re-run the installer with the same flags, passing the values file:

bash crusoe_watch_agent.sh --values values.yaml

Generate a monitoring token

A monitoring token is required to query metrics or logs via API, import data into Grafana, or use Telemetry Conduit. When the agent is installed at VM creation, a token is automatically provisioned for the agent; you still need to generate one separately to query the API manually.

crusoe monitoring tokens create

Store the token securely. You cannot retrieve it later.

note

If the token contains special characters such as $, reference it from a Kubernetes Secret using secretKeyRef to avoid parsing errors in Helm deployments.

Access Methods

Console

To access metrics, logs, and health status directly in the Crusoe Cloud Console:

  • Metrics and health status: Navigate to Command Center in the left navigation, and then select Infrastructure Overview.
  • Project-level logs: Navigate to Command Center in the left navigation, and then select Managed Logs.
  • Resource-specific metrics and logs: Use a specific VM's Metrics or Logs tab under Compute, or a CMK cluster's tab under Orchestration.

API

Metrics (PromQL): Query metrics via the Prometheus-compatible endpoint:

https://api.cloud.crusoe.ai/v1/projects/<project-id>/metrics/timeseries

Logs (LogsQL): Query logs via:

https://api.crusoecloud.com/v1/projects/<project-id>/logs

Authenticate with your monitoring token as a Bearer token. See Logs for the full endpoint reference.

Grafana

Add a Prometheus data source to your Grafana instance pointing to the metrics API endpoint, with an Authorization: Bearer <token> HTTP header.

Pre-built Grafana dashboard templates for CMK and Managed Slurm clusters are available in the Crusoe solutions library. Templates cover GPU utilization, InfiniBand fabric health, power draw, XID error tracking, Slurm job performance, storage, and network.

Telemetry Conduit

Export metrics continuously to Grafana, Datadog, or Splunk via a Prometheus-compatible scraping endpoint. See Telemetry Conduit for setup.

Crusoe MCP

You can access metrics and logs data collected by the Command Center via Crusoe MCP. See the Crusoe MCP page for setup instructions.