KV-cache memory allocation for LLM inference is a deterministic arithmetic problem defined by model architecture and batch size, not a heuristic guess.
The KV-cache stores the key and value attention states for every token processed during generation. In systems like vLLM, this cache is pinned to GPU VRAM and persists across the request lifecycle. Unlike model weights, which are static after loading, the KV-cache grows linearly with the sequence length and the number of concurrent requests. Misallocating this memory leads to immediate CUDA OOM errors or silent performance degradation as the system swaps to CPU memory.
Operators often treat VRAM as a single pool. In reality, the GPU memory budget splits into three distinct buckets: model weights, KV-cache, and CUDA context overhead. The model weights for a 70B parameter model in FP16 consume 140GB. If the GPU has 80GB of VRAM, the weights alone exceed capacity, requiring quantization or tensor parallelism. On a GPU with sufficient capacity, the remaining space must be partitioned between the weights and the dynamic KV-cache. The partition ratio is set at runtime, not at deployment time.
The memory arithmetic
The KV-cache size is calculated using the model’s internal dimensions. The formula multiplies the number of key-value pairs by the sequence length and the batch size. For a transformer model, the cache stores both keys and values for every token in the context window.
The calculation follows this structure: 2 × num_kv_heads × head_dim × context_length × batch_size × dtype_bytes. The factor of 2 accounts for the separate Key and Value matrices. num_kv_heads is the number of query heads divided by the group size for Grouped Query Attention (GQA). head_dim is typically 128 for modern models. dtype_bytes is 2 for FP16 or 1 for FP8.
For a Llama-2-70B model, the architecture specifies 80 layers, 8 KV heads, and a head dimension of 128. At FP16 precision, each token per layer requires 4KB of KV-cache memory. This is derived from 2 × 8 × 128 × 2. A single 8K context sequence consumes 32MB per layer. Across 80 layers, a single sequence requires 2.5GB of VRAM.
The following table shows the total KV-cache requirement for a 70B model across different batch sizes and context lengths at FP16.
| Batch Size | Context Length | KV-cache Memory |
|---|---|---|
| 8 | 4K | 10GB |
| 8 | 8K | 20GB |
| 12 | 8K | 30GB |
| 16 | 8K | 40GB |
A 32GB KV-cache budget supports a batch size of 12 at 8K context. Increasing the batch to 16 pushes the requirement to 40GB. This calculation excludes the model weights. A 70B model in FP16 requires 140GB for weights alone. On an 80GB GPU, the weights must be quantized to FP8 (70GB) or the model must be sharded across multiple GPUs. The KV-cache budget is the residual memory after weights are loaded.
The vLLM configuration
vLLM manages this memory through the gpu_memory_utilization argument. This flag sets the fraction of total GPU memory allocated for the engine, leaving the remainder for the CUDA context and system overhead. The default value is 0.9.
When gpu_memory_utilization is set to 0.9 on an 80GB GPU, vLLM reserves 72GB. If the model weights consume 70GB (quantized), only 2GB remains for the KV-cache. This configuration will fail to serve any meaningful batch size. The system must be tuned to balance the weights and the cache.
The configuration is passed as a command-line argument to the vLLM server.
python -m vllm.entrypoints.api_server \
--model meta-llama/Llama-2-70b-hf \
--tensor-parallel-size 2 \
--gpu-memory-utilization 0.9 \
--max-model-len 8192
The --max-model-len argument defines the maximum sequence length the engine will allocate space for. This is a hard limit. If the engine is configured for 8K context, it reserves the maximum KV-cache space for that length, even if the actual requests are shorter. This prevents fragmentation but reduces the available batch size.
vLLM uses PagedAttention to manage this memory efficiently. Instead of contiguous blocks, the cache is stored in fixed-size pages. This allows the system to support variable sequence lengths without internal fragmentation. However, the total number of pages is still limited by the gpu_memory_utilization setting.
Failure modes and symptoms
The most common failure mode is the CUDA Out Of Memory (OOM) error during request processing. This occurs when the requested batch size or context length exceeds the reserved KV-cache budget. The error appears in the vLLM logs as CUDA error: out of memory or RuntimeError: CUDA out of memory.
A subtler failure mode is request preemption. When the cache is full, vLLM may evict older requests to make space for new ones. This increases latency and reduces throughput. The system does not fail, but the quality of service degrades. Operators often mistake this for network latency or model slowness.
Monitoring the memory usage requires access to GPU metrics. The NVIDIA DCGM exporter exposes metrics like DCGM_FI_DEV_FB_USED. This metric shows the total framebuffer usage in bytes.
kubectl exec -n monitoring dcgm-exporter-0 -- dcgmi metrics query -i 15
If DCGM_FI_DEV_FB_USED is consistently near the total VRAM capacity, the gpu_memory_utilization is too high. The system has no room for the CUDA context or OS overhead. This leads to driver crashes or kernel panics.
Conversely, if the KV-cache is under-provisioned, the system cannot serve the requested concurrency. The vllm:num_requests_waiting metric will show requests queuing indefinitely. This indicates that the --max-num-seqs parameter is too low for the workload, or the gpu_memory_utilization leaves insufficient space for the cache.
Decision frame
The next time an LLM pod fails to start or exhibits high latency, check the memory partition before checking the network. The tradeoff is between model precision and concurrency. Quantizing the weights to FP8 frees 35GB of VRAM for a 70B model. That 35GB can support an additional 14 concurrent 8K sequences. The choice is not about speed, but about the cost of the GPU relative to the required throughput. Allocate the weights first, then calculate the residual cache budget. If the residual budget cannot support the target batch size, increase the GPU count or reduce the context length.