Logs
Managed Logs for CMK service is production-ready and currently in preview. Contact Crusoe Cloud Support if you are interested in enabling this feature.
Kubernetes pod logs and system logs are collected automatically and available in the Console. No SSH or manual log aggregation required. The Crusoe Watch Agent collects logs, which you can search, filter, and inspect directly in the Console.
Managed logs is available for Crusoe Managed Kubernetes (CMK) and will be available for Crusoe Virtual Machines (VMs) in the future.
Prerequisites
To use Logs, you need:
- CMK cluster with Crusoe Watch Agent installed (see Installing the Crusoe Watch Agent)
- NVIDIA GPU Operator add-on (if using GPU nodes)
Log Sources
The Crusoe Watch Agent collects the following log types:
Kubernetes Pod Logs
stdout and stderr from pods. Includes application logs from training/inference workloads and system pod output.
JournalD Logs
System-level logs from journald on each node:
| Log Source | Description |
|---|---|
| Kubelet | Kubernetes node agent logs — pod lifecycle, volume mounts, node status |
| dmesg | Kernel messages including GPU XID errors, OOM events, and hardware failures |
| Container runtime | Container start, stop, and error events |
Accessing Logs
Navigate to Orchestration > select your cluster > Logs sub-tab to view logs.
Searching and Filtering
You can use the following filters to narrow your log search:
| Filter | Description |
|---|---|
| Instance name | Filter logs by specific node or VM name |
| Log source | Select the log type: Kubernetes pods or JournalD |
| Severity | Filter by log severity level (see severity levels below) |
| Time window | Specify a start and end time to narrow results |
| Text search | Search log content using basic text matching |
Combine multiple filters to narrow results. For example, search for XID errors in dmesg logs from a specific node within the last 24 hours.
Log Severity Levels
Logs are normalized to a 5-tiered severity taxonomy:
| Severity | Description |
|---|---|
| Critical | Application cannot continue; requires immediate intervention |
| Error | Error handled, service continues |
| Warning | Unexpected situation, but handled gracefully |
| Info | Normal operational events (startup, shutdown, config changes) |
| Debug | Detailed diagnostic information |
RFC 5424 severity levels (Emergency, Alert, Critical) map to Critical. Notice and Informational map to Info. Some log sources (JournalD) assign severity based on output stream: stdout maps to INFO, stderr maps to higher severity. The original severity is preserved alongside the normalized value.
Log Retention
Logs are retained for 7 days and automatically purged after 7 days.
Common Troubleshooting Workflows
Investigating a GPU Hardware Failure
- In Topology, identify the unhealthy node.
- Click the node and select Generate NVIDIA Bug Report.
- Navigate to Logs and filter by node instance name.
- Set the log source to dmesg and search for
XID. - Review the kernel logs alongside the NVIDIA bug report.
Debugging a Failing Training Job
- Navigate to Logs and filter by Kubernetes pods.
- Search for the pod name or namespace.
- Review
stdoutandstderrfor CUDA errors, NCCL timeouts, or application failures. - Cross-reference with Metrics for GPU utilization drops or memory pressure.
Diagnosing Storage Mount Issues
- Navigate to Logs and filter by node instance name.
- Set the log source to JournalD (Kubelet).
- Search for mount errors:
MountVolume,nfs. - Check for filesystem errors, RAID issues, or NFS connectivity problems.
What's Next
- Topology — Identify unhealthy nodes and run diagnostics
- Metrics — Correlate log events with performance data
- Notifications — Get notified about resource health via email and in-console, and set up alert routing to Slack or webhooks