Logs

note

Managed Logs for CMK service is production-ready and currently in preview. Contact Crusoe Cloud Support if you are interested in enabling this feature.

Kubernetes pod logs and system logs are collected automatically and available in the Console. No SSH or manual log aggregation required. The Crusoe Watch Agent collects logs, which you can search, filter, and inspect directly in the Console.

Managed logs is available for Crusoe Managed Kubernetes (CMK) and will be available for Crusoe Virtual Machines (VMs) in the future.

Prerequisites

To use Logs, you need:

CMK cluster with Crusoe Watch Agent installed (see Installing the Crusoe Watch Agent)
NVIDIA GPU Operator add-on (if using GPU nodes)

Log Sources

The Crusoe Watch Agent collects the following log types:

Kubernetes Pod Logs

stdout and stderr from pods. Includes application logs from training/inference workloads and system pod output.

JournalD Logs

System-level logs from journald on each node:

Log Source	Description
Kubelet	Kubernetes node agent logs — pod lifecycle, volume mounts, node status
dmesg	Kernel messages including GPU XID errors, OOM events, and hardware failures
Container runtime	Container start, stop, and error events

Accessing Logs

Navigate to Orchestration > select your cluster > Logs sub-tab to view logs.

Searching and Filtering

You can use the following filters to narrow your log search:

Filter	Description
Instance name	Filter logs by specific node or VM name
Log source	Select the log type: Kubernetes pods or JournalD
Severity	Filter by log severity level (see severity levels below)
Time window	Specify a start and end time to narrow results
Text search	Search log content using basic text matching

Combine multiple filters to narrow results. For example, search for XID errors in dmesg logs from a specific node within the last 24 hours.

Log Severity Levels

Logs are normalized to a 5-tiered severity taxonomy:

Severity	Description
Critical	Application cannot continue; requires immediate intervention
Error	Error handled, service continues
Warning	Unexpected situation, but handled gracefully
Info	Normal operational events (startup, shutdown, config changes)
Debug	Detailed diagnostic information

note

RFC 5424 severity levels (Emergency, Alert, Critical) map to Critical. Notice and Informational map to Info. Some log sources (JournalD) assign severity based on output stream: stdout maps to INFO, stderr maps to higher severity. The original severity is preserved alongside the normalized value.

Log Retention

Logs are retained for 7 days and automatically purged after 7 days.

Common Troubleshooting Workflows

Investigating a GPU Hardware Failure

In Topology, identify the unhealthy node.
Click the node and select Generate NVIDIA Bug Report.
Navigate to Logs and filter by node instance name.
Set the log source to dmesg and search for XID.
Review the kernel logs alongside the NVIDIA bug report.

Debugging a Failing Training Job

Navigate to Logs and filter by Kubernetes pods.
Search for the pod name or namespace.
Review stdout and stderr for CUDA errors, NCCL timeouts, or application failures.
Cross-reference with Metrics for GPU utilization drops or memory pressure.

Diagnosing Storage Mount Issues

Navigate to Logs and filter by node instance name.
Set the log source to JournalD (Kubelet).
Search for mount errors: MountVolume, nfs.
Check for filesystem errors, RAID issues, or NFS connectivity problems.

What's Next

Topology — Identify unhealthy nodes and run diagnostics
Metrics — Correlate log events with performance data
Notifications — Get notified about resource health via email and in-console, and set up alert routing to Slack or webhooks

Logs

Prerequisites​

Log Sources​

Kubernetes Pod Logs​

JournalD Logs​

Accessing Logs​

Searching and Filtering​

Log Severity Levels​

Log Retention​

Common Troubleshooting Workflows​

Investigating a GPU Hardware Failure​

Debugging a Failing Training Job​

Diagnosing Storage Mount Issues​

What's Next​