Skip to main content

Logs

note

Managed Logs for CMK service is production-ready and currently in preview. Contact Crusoe Cloud Support if you are interested in enabling this feature.

Kubernetes pod logs and system logs are collected automatically and available in the Console. No SSH or manual log aggregation required. The Crusoe Watch Agent collects logs, which you can search, filter, and inspect directly in the Console.

Managed logs is available for Crusoe Managed Kubernetes (CMK) and will be available for Crusoe Virtual Machines (VMs) in the future.

Prerequisites

To use Logs, you need:

Log Sources

The Crusoe Watch Agent collects the following log types:

Kubernetes Pod Logs

stdout and stderr from pods. Includes application logs from training/inference workloads and system pod output.

JournalD Logs

System-level logs from journald on each node:

Log SourceDescription
KubeletKubernetes node agent logs — pod lifecycle, volume mounts, node status
dmesgKernel messages including GPU XID errors, OOM events, and hardware failures
Container runtimeContainer start, stop, and error events

Accessing Logs

Navigate to Orchestration > select your cluster > Logs sub-tab to view logs.

Searching and Filtering

You can use the following filters to narrow your log search:

FilterDescription
Instance nameFilter logs by specific node or VM name
Log sourceSelect the log type: Kubernetes pods or JournalD
SeverityFilter by log severity level (see severity levels below)
Time windowSpecify a start and end time to narrow results
Text searchSearch log content using basic text matching

Combine multiple filters to narrow results. For example, search for XID errors in dmesg logs from a specific node within the last 24 hours.

Log Severity Levels

Logs are normalized to a 5-tiered severity taxonomy:

SeverityDescription
CriticalApplication cannot continue; requires immediate intervention
ErrorError handled, service continues
WarningUnexpected situation, but handled gracefully
InfoNormal operational events (startup, shutdown, config changes)
DebugDetailed diagnostic information
note

RFC 5424 severity levels (Emergency, Alert, Critical) map to Critical. Notice and Informational map to Info. Some log sources (JournalD) assign severity based on output stream: stdout maps to INFO, stderr maps to higher severity. The original severity is preserved alongside the normalized value.

Log Retention

Logs are retained for 7 days and automatically purged after 7 days.

Common Troubleshooting Workflows

Investigating a GPU Hardware Failure

  1. In Topology, identify the unhealthy node.
  2. Click the node and select Generate NVIDIA Bug Report.
  3. Navigate to Logs and filter by node instance name.
  4. Set the log source to dmesg and search for XID.
  5. Review the kernel logs alongside the NVIDIA bug report.

Debugging a Failing Training Job

  1. Navigate to Logs and filter by Kubernetes pods.
  2. Search for the pod name or namespace.
  3. Review stdout and stderr for CUDA errors, NCCL timeouts, or application failures.
  4. Cross-reference with Metrics for GPU utilization drops or memory pressure.

Diagnosing Storage Mount Issues

  1. Navigate to Logs and filter by node instance name.
  2. Set the log source to JournalD (Kubelet).
  3. Search for mount errors: MountVolume, nfs.
  4. Check for filesystem errors, RAID issues, or NFS connectivity problems.

What's Next

  • Topology — Identify unhealthy nodes and run diagnostics
  • Metrics — Correlate log events with performance data
  • Notifications — Get notified about resource health via email and in-console, and set up alert routing to Slack or webhooks