What Velero backs up on an AI cluster

Velero captures Kubernetes API objects and optionally persistent volume data, but leaves stateless runtime artifacts like model weights outside its scope.

Velero is a Kubernetes backup and disaster recovery tool that operates at the control plane level. It does not run inside application containers. It interacts with the Kubernetes API server to list resources and with the underlying storage system to capture volume data. For AI workloads, this distinction matters because training pipelines often rely on ephemeral storage for intermediate checkpoints or pull model weights from external object stores at runtime. Velero protects the orchestration state, not the external data sources the orchestration references.

The system distinguishes between resources it manages directly and resources it delegates. API objects such as Deployments, Services, and Custom Resources like PyTorchJob are serialized to JSON and stored in a backup repository. Persistent Volumes (PVs) require a separate mechanism because they exist outside the API server. Velero delegates volume backup to either the Container Storage Interface (CSI) Snapshot Controller or a file-level backup agent like Restic. This split architecture defines the boundaries of what can be recovered in a disaster scenario.

The API object capture

Velero lists resources within a namespace based on the Backup resource specification. It includes standard Kubernetes resources and Custom Resource Definitions (CRDs) defined in the cluster. When a backup runs, the controller queries the API server for the selected namespaces. It retrieves the full manifest for each resource, including fields like spec, status, and metadata. These manifests are stored as JSON in the backup repository, typically an object store like Amazon S3 or MinIO.

For AI clusters, this means CRDs from the Kubeflow Training Operator, such as PyTorchJob or TFJob, are backed up alongside standard Deployments. However, the backup captures the definition of the job, not the runtime state of the model weights stored on the volume. If a PyTorchJob references a PersistentVolumeClaim, Velero backs up the PersistentVolumeClaim object, but the data inside the volume requires a separate volume backup operation. The Backup resource defines which namespaces are included and whether volumes are snapshotted.

apiVersion: velero.io/v1
kind: Backup
metadata:
  name: ai-cluster-backup
  namespace: velero
spec:
  includedNamespaces:
    - ml-training
    - ml-inference
  snapshotVolumes: true
  ttl: 720h0m0s

This configuration backs up all resources in the ml-training and ml-inference namespaces. Setting snapshotVolumes to true instructs Velero to attempt a volume backup for every PersistentVolumeClaim in the selected namespaces. The backup process does not validate the content of the volume; it only ensures the storage system creates a snapshot or copies the data. The operator must ensure the storage class supports the chosen backup method.

Volume backup methods

Velero supports two primary methods for backing up persistent data: CSI Snapshots and Restic. The choice between them determines the recovery time objective and the storage cost. CSI Snapshots rely on the VolumeSnapshot CRD and the storage provider’s native capability to create point-in-time copies. Restic uses a file-level backup agent running in a pod to copy data over the network to a backup repository.

The following table compares the two methods in the context of AI workloads. CSI Snapshots are storage-side operations, while Restic is network-side.

Method	Dependency	Speed	Impact on Node	Use Case
CSI Snapshot	Storage Driver	Fast	Low	Large datasets, production training
Restic	Network Bandwidth	Slow	High	Ephemeral storage, no CSI support

CSI Snapshots are generally faster for large datasets because the storage system handles the copy operation. This is critical for AI training checkpoints that may be hundreds of gigabytes. Restic is slower because it transfers data over the network to the backup repository. Restic is useful when the storage driver does not support snapshots or when the volume is not backed by a CSI driver. However, Restic agents consume CPU and I/O on the node, which can interfere with GPU training workloads.

What remains outside scope

Velero does not back up state that exists outside the Kubernetes API or the Persistent Volumes it manages. Training checkpoints stored on emptyDir volumes are lost if the node fails, as emptyDir is ephemeral and tied to the pod lifecycle. Model weights pulled from an external S3 bucket at runtime are not backed up by Velero because they reside in an external service. GPU driver state on the node is not backed up because it is part of the host operating system, not the cluster state.

The emptyDir volume is a common source of data loss in AI pipelines. If a training pod uses emptyDir to store intermediate gradients or checkpoints, Velero ignores this data. The backup will restore the PyTorchJob manifest, but the pod will start with an empty volume. The training process will fail or restart from the beginning. Similarly, if the cluster is restored to a different region, the storage class may map to a different storage backend, requiring manual reconfiguration of the VolumeSnapshotClass.

Failure modes

Recovery often fails due to missing dependencies in the target cluster. If Velero restores a PyTorchJob but the Kubeflow Training Operator is not installed, the pod remains in a Pending state or fails to reconcile. The operator logs will show errors related to missing CRDs. If the storage class in the target cluster does not support the VolumeSnapshot class used in the backup, the restore will fail to provision the volume.

Data inconsistency is another failure mode. If a backup occurs while a training job is writing to a volume, the snapshot may capture a partial write. The restored volume will contain corrupted data. CSI snapshots mitigate this by using storage-level consistency, but file-level backups like Restic may capture data mid-write. The operator must ensure the application flushes buffers before the backup runs or use application-consistent snapshots.

GPU driver state is a silent failure mode. If the cluster is restored to a new set of nodes, the new nodes must have the correct NVIDIA drivers installed. Velero does not install drivers. If the drivers are missing, the pods will fail to start with an error indicating the GPU device is not available. This requires an external automation tool like a Node DaemonSet to manage driver installation.

Decision frame

The question the next time a disaster recovery plan is reviewed is not “did the backup run.” It is “does the backup include the storage class and CRDs required to restore the workload.” Velero restores the manifest, but the manifest depends on external infrastructure like CSI drivers and model registries. The recovery point objective is determined by the storage class compatibility, not the backup schedule. If the target cluster lacks the VolumeSnapshotClass used in the source, the data cannot be restored regardless of backup integrity. Verify the storage class mapping before the disaster occurs.