Shared Responsibility Model
Crusoe Managed Kubernetes (CMK) is a managed Kubernetes service purpose-built for GPU-accelerated and AI/ML workloads. Operating a production Kubernetes environment is a shared effort — Crusoe is responsible for the underlying infrastructure and control plane, while you are responsible for your workloads and applications. The layers in between involve shared ownership, and this document describes the specific boundaries at each layer.
Responsibility by Layer
At a high level, Crusoe manages the infrastructure and control plane. You manage your workloads and applications. The layers in between are shared, with clear ownership at each component. Where a category appears in multiple columns below, individual components within that category have different owners — the detailed sections that follow break down which parts are Crusoe-managed and which are shared or customer-owned.
| Crusoe | Shared | Customer | |
|---|---|---|---|
| Workloads | ● | ||
| Application Config | ● | ||
| Access & RBAC | ● | ||
| Add-Ons | ● | ||
| Worker Nodes | ● | ||
| Networking | ● | ||
| Storage | ● | ||
| Control Plane | ● | ||
| Infrastructure | ● |
What This Means at Each Layer
Infrastructure
We own the physical environment. You never touch hardware.
Crusoe manages: Physical hardware (GPUs, CPUs), data centers, power, cooling, hypervisor / bare-metal provisioning, and InfiniBand RDMA networking fabric.
Control Plane
We run and upgrade the Kubernetes control plane. You choose the version and tell us when to upgrade.
Crusoe manages: kube-apiserver, etcd, kube-scheduler, kube-controller-manager. We handle provisioning (minimum 3 nodes, distributed for HA), availability, scaling, patching, and Kubernetes version upgrades.
You manage: Kubernetes version selection at cluster creation. When you are ready to upgrade, you request the upgrade from Crusoe and we handle it.
Worker Nodes
Crusoe provides the base machine images for worker nodes. You decide how many nodes to run, how they are organized into node pools, and when to apply updates.
Crusoe manages: Worker node base OS images and base OS configuration.
Shared: OS updates and patching (Crusoe provides updated images; you apply them by cycling nodes in your node pools). Cluster Autoscaler (Crusoe provides a CMK-compatible build; you configure min/max and deploy). AutoClusters (Crusoe detects hardware failures and remediates when allowed; you configure settings).
You manage: Node pool creation, configuration, sizing, manual scaling, and deletion (required before cluster deletion).
Networking
Crusoe provides the cluster networking layer, load balancers, and the firewall rules required for cluster operation. You configure your network topology, application traffic rules, and any additional firewall rules for your workloads.
Crusoe manages: CNI (Cilium) — installation and default configuration. Load balancers (L4 passthrough). Cluster and node pool firewall rules required for cluster operation (created on cluster creation, updated as necessary, removed on cluster deletion).
Shared: Firewall configuration (Crusoe creates and manages rules required for cluster functionality; you create and manage rules for exposing your applications and workloads).
You manage: VPC / subnet configuration, Pod CIDR, Subnet Mask and Service CIDR (set at cluster creation), Kubernetes Services and Ingress objects, network policies, and application-level firewall rules.
Storage
We provide the underlying storage infrastructure. You define how storage is allocated and consumed by your workloads.
Crusoe manages: Persistent Disk block storage infrastructure, Shared Disk / NFS (VAST Data) backend. NFS enablement must be requested from Crusoe Support per project.
Shared: Crusoe CSI Driver (Crusoe provides the Helm chart; you install and configure).
You manage: StorageClass definitions, PersistentVolumeClaim definitions, and any storage migration steps (e.g., VirtioFS to NFS).
Add-Ons and Operators
Crusoe provides a set of core cluster add-ons that extend cluster functionality for GPU-accelerated workloads. You may opt in to these add-ons at cluster creation or install them later via Helm. Any additional operators or charts you bring are yours to manage.
Crusoe manages: Installation of core cluster add-ons, including Cilium, the NVIDIA GPU Operator, and the NVIDIA Network Operator (required for InfiniBand-enabled instances). Crusoe manages critical upgrades to these add-ons.
Shared: Add-on configuration (Crusoe provides a reference configuration for GPU and Network Operators; you may customize settings such as driver versions to suit your workloads).
You manage: All other third-party Helm charts and operators. If you install your own GPU or Network Operator outside of the Crusoe-provided add-ons, or significantly modify the Crusoe-provided configuration, that add-on becomes Customer-Owned.
Identity, Access, and Security
We provide the IAM framework and container registry. You manage who has access and what they can do.
Crusoe manages: Project-level IAM (Admin / Editor / Reader roles). Crusoe Container Registry (CCR) infrastructure.
Shared: kubeconfig generation (Crusoe generates; you download and manage locally). CCR token rotation (Crusoe provides CronJob Helm chart; you install and configure). OIDC configuration (you provide the identity provider configuration; Crusoe applies it to cluster components such as kube-apiserver).
You manage: Organization member roles, Crusoe API keys, Kubernetes RBAC (all in-cluster roles, bindings, and authorization — including when using OIDC), application secrets, CCR repository creation, and image push/pull.
Observability
Crusoe surfaces cluster and hardware metrics to help you monitor performance and utilization. If you install the Crusoe Watch Agent, Crusoe also collects and manages a subset of node-level logs. You are responsible for application-level monitoring and log aggregation.
Crusoe manages: Cluster metrics (Prometheus-compatible endpoint), GPU / interconnect / host telemetry, active hardware health checks (AutoClusters), and node-level log collection when the Crusoe Watch Agent is installed.
You manage: Application-level monitoring (Grafana, etc.) and log aggregation (Loki, Fluentd, etc.).
Workloads and Applications
Everything you deploy in the cluster is yours.
You manage: All workloads and application-level resources, including Deployments, StatefulSets, DaemonSets, Jobs, namespaces, resource quotas, pod scheduling and affinity rules, custom schedulers, AI/ML training job orchestration (Kubeflow, PyTorchJob, etc.), and ConfigMaps.
Support Scope
Supported: Crusoe owns it. Covered by our service-level commitments. File a ticket; we own resolution. Includes control plane, worker node OS, GPU drivers, Cilium, Persistent Disk infrastructure, cluster metrics.
Best-Effort: We investigate and advise, but do not own the outcome. Not covered by SLA. Includes GPU driver issues under specific workload patterns, NCCL tuning, performance optimization, custom scheduling conflicts.
Customer-Owned: You install and operate it. If it breaks the cluster, we restore cluster health but will not debug the component. Including but not limited to: service mesh (Istio, Linkerd), custom ingress controllers, custom schedulers, third-party operators, application monitoring, CI/CD tooling, custom admission webhooks.
Cluster Stability and Component Conflicts
Because your CMK clusters are dedicated to you, you have full control over the workloads and operators you deploy. However, Crusoe remains responsible for the health and uptime of the Control Plane.
If a customer-installed component (such as a custom webhook or third-party operator) causes the Control Plane to fail or degrades the underlying infrastructure, our priority is to restore baseline health. We will notify you to fix or remove the component.
For questions about specific components not covered here, contact Crusoe support.