Model weights belong in object storage, not container images

A 140GB model weight file does not belong in a container image — an init-container pulls it to a shared emptyDir volume before the inference container starts, trading image build time for cold-start latency.

The common pattern of baking model weights into a Docker image fails at scale. A single 70B parameter model in FP16 precision requires 140GB of storage. Container registries are not designed for this throughput, and pulling such images across cluster nodes consumes bandwidth that could serve inference requests. The Kubernetes project defines the init-container mechanism precisely for this use case: a container that runs to completion before the main application starts, writing to shared volumes that subsequent containers can read.

The sidecar pattern with an init-container separates the model artifact from the inference runtime. The inference container (vLLM, TGI, or Triton) is small and versioned independently of the weights. The init-container downloads weights from Hugging Face, S3, MLflow, or MinIO to an emptyDir volume mounted in both containers. This keeps the container image under 10GB while allowing any model to be served without rebuilding the image.

The tradeoff is not about correctness — it is about latency. Every pod restart incurs the full download time. A 140GB file on a 10Gbps network link takes 112 seconds to transfer. The scheduler cannot place the pod into a Ready state until the init-container exits successfully.

The init-container volume mechanism

An init-container shares volumes with the main containers in the same Pod through the spec.volumes array. The emptyDir volume type creates a temporary storage area on the node’s filesystem. When multiple containers mount the same emptyDir, they read and write to the same data.

The following YAML shows the pattern. The init-container downloads weights from Hugging Face using the huggingface-cli tool. The vLLM inference container mounts the same volume at /model.

apiVersion: v1
kind: Pod
metadata:
  name: vllm-llama-70b
spec:
  initContainers:
  - name: download-weights
    image: huggingface/huggingface-cli:latest
    command:
    - huggingface-cli
    - download
    - meta-llama/Meta-Llama-3-70B
    - --local-dir
    - /model
    volumeMounts:
    - name: model-storage
      mountPath: /model
  containers:
  - name: vllm
    image: vllm/vllm:latest
    command:
    - python
    - -m
    - vllm.entrypoints.api_server
    - --model
    - /model
    volumeMounts:
    - name: model-storage
      mountPath: /model
  volumes:
  - name: model-storage
    emptyDir:
      sizeLimit: 150Gi

The sizeLimit field on emptyDir is optional but recommended for large downloads. It prevents a single pod from exhausting node disk capacity. When the write exceeds the sizeLimit, the kubelet terminates the container with a disk quota failure. The Pod event shows Reason: Failed with a message about exceeded disk quota, not OOMKilled. The OOMKilled reason applies only to memory limits, not volume capacity.

The init-container restarts automatically on failure. Kubernetes applies a backoff policy via CrashLoopBackOff, retrying the download with increasing delays. The Pod does not require manual intervention to restart the download attempt. A failed init-container keeps the Pod in the Init:Error or Init:CrashLoopBackOff phase until the script succeeds or the Pod is deleted.

Bandwidth math and cold-start latency

The download time is deterministic based on model size and network bandwidth. A 70B parameter model in FP16 precision requires 140GB of storage. The calculation is 70 billion parameters × 2 bytes per parameter. This is the baseline; quantized models reduce the size but require different inference engines.

The following table shows download times across common network speeds. These numbers assume no caching, no CDN, and no retry logic.

Network Speed	140GB Download Time	Pod Ready Delay
1Gbps	19 minutes 10 seconds	19+ minutes
10Gbps	1 minute 55 seconds	2+ minutes
25Gbps	47 seconds	1 minute
100Gbps	12 seconds	15 seconds

The scheduler places the Pod on a node with available GPU resources. The node then runs the init-container. The kubelet reports the Pod as Ready only after the init-container exits successfully. This means the Pod spends 2+ minutes in the Init phase on a 10Gbps link. The vLLM container does not start during this time.

This latency compounds in horizontal scaling scenarios. Scaling from 1 to 4 replicas means 4 separate downloads. If the model is 140GB and the cluster has 4 nodes with 10Gbps links, the total bandwidth consumed is 4×140GB = 560GB across the network. The cluster’s control plane does not coordinate these downloads. Each Pod acts independently.

The huggingface-cli download command supports resuming interrupted downloads. If the init-container fails at 50GB, the next attempt starts from 50GB rather than 0. This is critical for large models on unreliable networks. The --local-dir argument must be the same across retries for the resume logic to work.

Failure modes and disk quota

Exceeding the emptyDir sizeLimit terminates the container. The kubelet monitors volume usage and kills the container when the write exceeds the configured limit. The Pod event shows Reason: Failed with a message about disk quota exceeded. This is distinct from OOMKilled, which applies only to memory limits.

The sizeLimit field defaults to no limit on most Kubernetes distributions. Without an explicit limit, a 140GB download can exhaust the node’s disk capacity. This prevents other Pods from starting and can trigger node-level eviction. The kubelet eviction threshold for disk pressure is typically 10% of available disk space. A single 140GB download on a 500GB node disk can trigger this threshold.

Network failures during download cause the init-container to exit with a non-zero code. The Pod enters Init:CrashLoopBackOff and retries with exponential backoff. The default backoff is 10 seconds, then 20, then 40, up to 5 minutes. This means a transient network failure can keep the Pod in a non-Ready state for 10+ minutes.

The init-container script must handle authentication for private registries. Hugging Face requires a HF_TOKEN environment variable. AWS S3 requires AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY. These credentials should be injected via Kubernetes Secrets, not hardcoded in the Pod spec. The following command shows how to check the init-container status:

kubectl get pod vllm-llama-70b -o jsonpath='{.status.initContainerStatuses[0].state}'

This returns the current state of the init-container. If the state shows waiting with a reason: CrashLoopBackOff, the download is failing repeatedly. The logs show the specific error:

kubectl logs vllm-llama-70b -c download-weights

The logs will show HTTP 403 for authentication failures, HTTP 500 for server errors, or connection timeouts for network issues. The init-container does not differentiate between these failure modes — it simply exits with a non-zero code.

The emptyDir volume is deleted when the Pod is deleted. This means every new Pod must re-download the weights. The kubelet does not cache the volume across Pod lifecycles. If the Pod is evicted due to node pressure, the weights are lost. This is the primary operational cost of the pattern.

Decision frame

The question the next time a model-weight Pod stays in Init state is not ‘is the init-container broken.’ It is ‘did the download exceed the emptyDir sizeLimit or the node’s disk quota.’ The CrashLoopBackOff event does not distinguish between these failure modes. Check the Pod events with kubectl describe pod vllm-llama-70b for the Reason: Failed message that shows disk quota exceeded, or check the init-container logs for network errors. The tradeoff is fixed: model weights in object storage reduce image size but add cold-start latency proportional to model size and network bandwidth. If the cluster serves 100+ replicas of a 140GB model, the total download bandwidth will saturate a 10Gbps link for 19 minutes during a full rollout.