Eviction signals interrupt training checkpoints

Node-pressure eviction sends SIGTERM to training pods, interrupting checkpoint writes before the process receives SIGKILL.

The kubelet monitors local node resources and triggers eviction when thresholds for memory, disk, or PID usage are breached. Training workloads running distributed frameworks like PyTorch or DeepSpeed rely on periodic checkpointing to persist model state. When a node enters eviction mode, the kubelet initiates a termination sequence that signals the container process to stop. If the application is in the middle of writing a checkpoint file when this signal arrives, the resulting file can be incomplete or corrupted.

This interaction is not a failure of the scheduler but a failure of the application’s signal handling within the terminationGracePeriodSeconds. The Kubernetes API defines a grace period during which the container must exit gracefully. The kubelet enforces this window strictly. If the checkpoint write takes longer than the grace period, the process receives SIGKILL, terminating the write operation immediately. The file system is left with a partially written blob that the training job cannot resume from.

The eviction sequence

Eviction begins when the kubelet detects that a node’s local resources have fallen below configured thresholds. These thresholds are defined in the KubeletConfiguration and apply to memory.available, nodefs.available, and imagefs.available. When a threshold is breached, the kubelet enters eviction mode and selects pods for termination based on priority and resource usage.

For training pods, the critical resource is often nodefs.available. Distributed training checkpoints can consume hundreds of gigabytes of local ephemeral storage. If the node’s disk fills up during a save operation, the kubelet may trigger eviction to reclaim space. The eviction logic prioritizes pods with lower priorities, but if the pressure is severe, it targets all pods.

The kubelet does not kill the container immediately. It sends SIGTERM to the main process. The process is expected to catch this signal, stop accepting new work, and finish in-flight operations. The kubelet then waits for terminationGracePeriodSeconds. If the process is still running after this duration, the kubelet sends SIGKILL. This two-step sequence allows for graceful shutdown but introduces a race condition: the checkpoint write must complete within the grace period, or the file is truncated.

The default eviction thresholds in a standard Kubernetes installation are shown below. These values determine when the kubelet decides the node is under pressure.

Threshold	Default Value	Metric Source
`memory.available`	100Mi	`node_memory_MemAvailable_bytes`
`nodefs.available`	10%	`node_filesystem_avail_bytes`
`imagefs.available`	15%	`container_fs_available_bytes`
`pid.available`	1000	`node_pids_available`

When nodefs.available drops below 10%, the kubelet begins marking pods for eviction. If the training pod is writing a checkpoint at that moment, the write operation competes with the eviction process for the remaining disk space.

The termination signal flow

The termination sequence is governed by the terminationGracePeriodSeconds field in the Pod spec. This value defaults to 30 seconds. During this window, the kubelet waits for the container runtime to report that the process has exited.

The flow proceeds as follows:

Signal Injection: The kubelet sends SIGTERM to PID 1 of the container.
Grace Period: The kubelet starts a timer set to terminationGracePeriodSeconds.
Process Exit: If the process exits before the timer expires, the kubelet proceeds to cleanup.
Force Kill: If the timer expires and the process is still running, the kubelet sends SIGKILL.

Training frameworks often install signal handlers to catch SIGTERM. In PyTorch, this handler is responsible for calling torch.save or dist.barrier to synchronize checkpointing across ranks. However, the handler must complete the write operation before the grace period ends.

If the checkpoint is large (e.g., 50GB) and the disk I/O is slow, the write may take longer than 30 seconds. Even if the application catches SIGTERM, the kubelet will not extend the grace period automatically. Once the timer hits zero, SIGKILL is sent. SIGKILL cannot be caught by the application. It terminates the process immediately, leaving the file system in an inconsistent state.

The following YAML snippet shows a training Pod spec configured with an extended grace period to accommodate checkpoint writes.

apiVersion: v1
kind: Pod
metadata:
  name: training-job
spec:
  terminationGracePeriodSeconds: 300
  containers:
  - name: pytorch-trainer
    image: pytorch/pytorch:2.1-cuda12
    resources:
      limits:
        nvidia.com/gpu: 4
    volumeMounts:
    - name: checkpoint-storage
      mountPath: /checkpoints
  volumes:
  - name: checkpoint-storage
    persistentVolumeClaim:
      claimName: training-pvc

Setting terminationGracePeriodSeconds to 300 allows 5 minutes for the checkpoint to flush to disk. This reduces the risk of truncation but delays node reclaiming.

Failure modes

The most common failure mode is a partial checkpoint file. This occurs when SIGTERM arrives during a write operation, and the application does not complete the write before SIGKILL is sent. The file system contains a file with the correct name but incorrect size. When the job restarts, the training framework attempts to load the checkpoint. The load operation fails because the file is incomplete, causing the job to restart from the last valid checkpoint or fail entirely.

A second failure mode involves disk pressure during the write. If eviction is triggered because nodefs.available is low, the disk may not have enough space to complete the checkpoint write. The application receives SIGTERM and attempts to write to the same disk that triggered the eviction. The write fails with a “No space left on device” error. The application may catch the error and exit, but the checkpoint is never saved. This is distinct from OOM killing; the process is terminated by the kubelet, not the kernel’s OOM killer, but the result is similar: lost progress.

A third failure mode is the “zombie checkpoint.” If the application ignores SIGTERM and continues running, the kubelet will send SIGKILL after the grace period. If the process was in the middle of a multi-file save operation (common in sharded checkpoints), some files may be written while others are not. The training framework sees a mix of valid and missing files, leading to a corrupted state that requires manual intervention to resolve.

The kubelet does not guarantee that the disk is flushed to stable storage before SIGKILL. The operating system’s page cache may still hold data that has not been written to the underlying block device. If the node reboots immediately after eviction, data in the page cache is lost.

Decision frame

The tradeoff is between node reclaim speed and checkpoint integrity. Increasing terminationGracePeriodSeconds gives the application more time to flush checkpoints but delays the kubelet from marking the node as available for new workloads. If the node is under severe disk pressure, extending the grace period may not help if the disk is already full; the write will fail regardless of time.

The question the next time a training job loses a checkpoint is not “did the node run out of memory.” It is “did the terminationGracePeriodSeconds exceed the checkpoint write time.” Check the kubelet logs for the eviction event timestamp and compare it to the application logs for the checkpoint completion timestamp. If the SIGKILL occurred before the write finished, increase the grace period or move the checkpoint volume to a faster, less congested storage class.