Skip to main content

Logs

note

Managed Logs is in production-ready preview. Please reach out to Crusoe Cloud Support to learn more.

System logs are collected automatically and available in the Console. No SSH or manual log aggregation required. The Crusoe Watch Agent collects logs, which you can search, filter, and inspect directly in the Console.

Managed logs is available for both Crusoe Managed Kubernetes (CMK) clusters and Crusoe Virtual Machines (VMs).

Prerequisites

To use Logs, you need:

For CMK clusters:

For VMs:

Log Sources

The Crusoe Watch Agent collects the following log types:

JournalD Logs

System-level logs from journald on each CMK node or VM.

Log SourceDescriptionAvailability
KubeletKubernetes node agent logs — pod lifecycle, volume mounts, node statusCMK only
Kernel logsKernel messages including GPU XID errors, OOM events, and hardware failuresCMK and VM
System logsSystem-level service logs and eventsCMK and VM
Container runtimeContainer start, stop, and error eventsCMK only

Accessing Logs Using Console UI

You can access logs in Console UI in two ways:

  1. Managed Logs page — Navigate to Managed Logs in the left navigation bar to search logs across all your CMK clusters and VMs in a unified view.
  2. Resource-specific view — Navigate to Orchestration > select your cluster > Logs tab.

Searching and Filtering

You can use the following filters to narrow your log search:

FilterDescription
Instance nameFilter logs by specific node or VM name
Log sourceSelect the log type: Kubernetes pods or JournalD
SeverityFilter by log severity level (see severity levels below)
Time windowSpecify a start and end time to narrow results
Text searchSearch log content using basic text matching

Combine multiple filters to narrow results. For example, search for XID errors in dmesg logs from a specific node within the last 24 hours.

Log Severity Levels

Logs are normalized to a 5-tiered severity taxonomy:

SeverityDescription
CriticalApplication cannot continue; requires immediate intervention
ErrorError handled, service continues
WarningUnexpected situation, but handled gracefully
InfoNormal operational events (startup, shutdown, config changes)
DebugDetailed diagnostic information
note

RFC 5424 severity levels (Emergency, Alert, Critical) map to Critical. Notice and Informational map to Info. Some log sources (JournalD) assign severity based on output stream: stdout maps to INFO, stderr maps to higher severity. The original severity is preserved alongside the normalized value.

Querying Logs via API

You can programmatically query logs using the LogsQL API endpoint. This allows you to retrieve logs for automation, integrate with external tools, or perform custom analysis.

API Endpoint

https://api.crusoecloud.com/v1alpha5/projects/<project-id>/logs/query

Authentication

Use the same monitoring token generated for metrics access (see Virtual Machines Metrics or CMK Metrics).

Examples

Query logs for a specific VM:

curl -G "https://api.crusoecloud.com/v1alpha5/projects/$project_id/logs/query" \
-H "Authorization: Bearer $monitoringtoken" \
--data-urlencode "query=crusoe_vm_id:$vm_id"

Limit results to 10 entries:

curl -G "https://api.crusoecloud.com/v1alpha5/projects/$project_id/logs/query" \
-H "Authorization: Bearer $monitoringtoken" \
--data-urlencode "query=crusoe_vm_id:$vm_id" \
--data-urlencode "limit=10"

Search for error logs in a specific VM:

curl -G "https://api.crusoecloud.com/v1alpha5/projects/$project_id/logs/query" \
-H "Authorization: Bearer $monitoringtoken" \
--data-urlencode "query=crusoe_vm_id:$vm_id AND error"

Log Retention

Logs are retained for 7 days and automatically purged after 7 days.

Common Troubleshooting Workflows

Investigating a GPU Hardware Failure

  1. In Topology, identify the unhealthy node.
  2. Click the node and select Generate NVIDIA Bug Report.
  3. Navigate to Logs and filter by node instance name.
  4. Set the log source to dmesg and search for XID.
  5. Review the kernel logs alongside the NVIDIA bug report.

Debugging a Failing Training Job

  1. Navigate to Logs and filter by Kubernetes pods.
  2. Search for the pod name or namespace.
  3. Review stdout and stderr for CUDA errors, NCCL timeouts, or application failures.
  4. Cross-reference with Metrics for GPU utilization drops or memory pressure.

Diagnosing Storage Mount Issues

  1. Navigate to Logs and filter by node instance name.
  2. Set the log source to JournalD (Kubelet).
  3. Search for mount errors: MountVolume, nfs.
  4. Check for filesystem errors, RAID issues, or NFS connectivity problems.

What's Next

  • Topology — Identify unhealthy nodes and run diagnostics
  • Metrics — Correlate log events with performance data
  • Notifications — Get notified about resource health via email and in-console, and set up alert routing to Slack or webhooks