Logs
Managed Logs is in production-ready preview. Please reach out to Crusoe Cloud Support to learn more.
System logs are collected automatically and available in the Console. No SSH or manual log aggregation required. The Crusoe Watch Agent collects logs, which you can search, filter, and inspect directly in the Console.
Managed logs is available for both Crusoe Managed Kubernetes (CMK) clusters and Crusoe Virtual Machines (VMs).
Prerequisites
To use Logs, you need:
For CMK clusters:
- CMK cluster with Crusoe Watch Agent version 0.3.1 or above installed (see Installing the Crusoe Watch Agent)
- NVIDIA GPU Operator add-on (if using GPU nodes)
For VMs:
- Crusoe Watch Agent version 1.0.1 or above installed (see Virtual Machines Metrics)
Log Sources
The Crusoe Watch Agent collects the following log types:
JournalD Logs
System-level logs from journald on each CMK node or VM.
| Log Source | Description | Availability |
|---|---|---|
| Kubelet | Kubernetes node agent logs — pod lifecycle, volume mounts, node status | CMK only |
| Kernel logs | Kernel messages including GPU XID errors, OOM events, and hardware failures | CMK and VM |
| System logs | System-level service logs and events | CMK and VM |
| Container runtime | Container start, stop, and error events | CMK only |
Accessing Logs Using Console UI
You can access logs in Console UI in two ways:
- Managed Logs page — Navigate to Managed Logs in the left navigation bar to search logs across all your CMK clusters and VMs in a unified view.
- Resource-specific view — Navigate to Orchestration > select your cluster > Logs tab.
Searching and Filtering
You can use the following filters to narrow your log search:
| Filter | Description |
|---|---|
| Instance name | Filter logs by specific node or VM name |
| Log source | Select the log type: Kubernetes pods or JournalD |
| Severity | Filter by log severity level (see severity levels below) |
| Time window | Specify a start and end time to narrow results |
| Text search | Search log content using basic text matching |
Combine multiple filters to narrow results. For example, search for XID errors in dmesg logs from a specific node within the last 24 hours.
Log Severity Levels
Logs are normalized to a 5-tiered severity taxonomy:
| Severity | Description |
|---|---|
| Critical | Application cannot continue; requires immediate intervention |
| Error | Error handled, service continues |
| Warning | Unexpected situation, but handled gracefully |
| Info | Normal operational events (startup, shutdown, config changes) |
| Debug | Detailed diagnostic information |
RFC 5424 severity levels (Emergency, Alert, Critical) map to Critical. Notice and Informational map to Info. Some log sources (JournalD) assign severity based on output stream: stdout maps to INFO, stderr maps to higher severity. The original severity is preserved alongside the normalized value.
Querying Logs via API
You can programmatically query logs using the LogsQL API endpoint. This allows you to retrieve logs for automation, integrate with external tools, or perform custom analysis.
API Endpoint
https://api.crusoecloud.com/v1alpha5/projects/<project-id>/logs/query
Authentication
Use the same monitoring token generated for metrics access (see Virtual Machines Metrics or CMK Metrics).
Examples
Query logs for a specific VM:
curl -G "https://api.crusoecloud.com/v1alpha5/projects/$project_id/logs/query" \
-H "Authorization: Bearer $monitoringtoken" \
--data-urlencode "query=crusoe_vm_id:$vm_id"
Limit results to 10 entries:
curl -G "https://api.crusoecloud.com/v1alpha5/projects/$project_id/logs/query" \
-H "Authorization: Bearer $monitoringtoken" \
--data-urlencode "query=crusoe_vm_id:$vm_id" \
--data-urlencode "limit=10"
Search for error logs in a specific VM:
curl -G "https://api.crusoecloud.com/v1alpha5/projects/$project_id/logs/query" \
-H "Authorization: Bearer $monitoringtoken" \
--data-urlencode "query=crusoe_vm_id:$vm_id AND error"
Log Retention
Logs are retained for 7 days and automatically purged after 7 days.
Common Troubleshooting Workflows
Investigating a GPU Hardware Failure
- In Topology, identify the unhealthy node.
- Click the node and select Generate NVIDIA Bug Report.
- Navigate to Logs and filter by node instance name.
- Set the log source to dmesg and search for
XID. - Review the kernel logs alongside the NVIDIA bug report.
Debugging a Failing Training Job
- Navigate to Logs and filter by Kubernetes pods.
- Search for the pod name or namespace.
- Review
stdoutandstderrfor CUDA errors, NCCL timeouts, or application failures. - Cross-reference with Metrics for GPU utilization drops or memory pressure.
Diagnosing Storage Mount Issues
- Navigate to Logs and filter by node instance name.
- Set the log source to JournalD (Kubelet).
- Search for mount errors:
MountVolume,nfs. - Check for filesystem errors, RAID issues, or NFS connectivity problems.
What's Next
- Topology — Identify unhealthy nodes and run diagnostics
- Metrics — Correlate log events with performance data
- Notifications — Get notified about resource health via email and in-console, and set up alert routing to Slack or webhooks