What it takes to stream LLM responses through Kubernetes ingress

Streaming LLM responses through Kubernetes ingress fails silently unless the ingress controller disables response buffering and extends read timeouts.

Large language models generate text token by token, sending data to the client incrementally over Server-Sent Events or chunked transfer encoding. The Kubernetes ingress controller sits between the client and the inference backend, such as vLLM or TGI. By default, ingress controllers buffer the entire response before forwarding it to the client to optimize throughput and protect against slow upstream connections. This buffering behavior breaks the streaming contract, causing the client to wait for the full generation before receiving the first token.

Platform engineers often assume the backend is responsible for streaming configuration. In reality, the network path between the client and the backend introduces a new layer of state management. The ingress controller must be configured to pass through data as it arrives, rather than waiting for a complete payload. This requires specific annotations for NGINX-based controllers and protocol options for Envoy-based proxies. Without these settings, the latency of an LLM generation becomes the sum of the generation time plus the network round-trip time, negating the interactive value of streaming.

The NGINX Ingress Controller buffering mechanism

The NGINX Ingress Controller buffers upstream responses by default to handle slow clients and upstreams independently. When a pod sends a response, NGINX writes the data into memory or disk buffers before forwarding it to the client connection. The default buffer size is often 4KB or 8KB, depending on the controller version and configuration. For LLM workloads, this buffer fills up quickly if the backend sends data in small chunks, but the controller may still hold the data until the buffer is full or the upstream connection closes.

To disable this behavior, the ingress resource must include the nginx.ingress.kubernetes.io/proxy-buffering annotation set to "off". This forces NGINX to stream data directly from the upstream socket to the client socket without intermediate storage. Additionally, the nginx.ingress.kubernetes.io/proxy-read-timeout annotation must be set to a high value, such as "3600", to prevent the connection from timing out during long generations. The default timeout is often 60 seconds, which is insufficient for models generating thousands of tokens.

The following YAML demonstrates the required configuration for an Ingress resource pointing to a vLLM service. The annotations ensure that tokens are forwarded immediately and the connection remains open.

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: vllm-streaming
  annotations:
    nginx.ingress.kubernetes.io/proxy-buffering: "off"
    nginx.ingress.kubernetes.io/proxy-read-timeout: "3600"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "3600"
spec:
  ingressClassName: nginx
  rules:
    - http:
        paths:
          - path: /v1/chat/completions
            pathType: Prefix
            backend:
              service:
                name: vllm-service
                port:
                  number: 8000

Without the proxy-buffering: "off" annotation, the client observes a delay equal to the time required to generate the full response. The proxy-read-timeout annotation prevents the ingress controller from closing the connection prematurely if the backend takes longer than the default 60 seconds to complete the generation. This configuration aligns the network layer with the token-by-token production rate of the model.

The Envoy Proxy HTTP/2 path

Envoy Proxy handles HTTP/2 streaming differently than NGINX. Envoy relies on http2_protocol_options to manage the behavior of HTTP/2 connections, including chunked transfer encoding. By default, Envoy may buffer responses to optimize for throughput, similar to NGINX. To enable streaming, the configuration must allow chunked length and ensure that timeouts do not interrupt the stream.

In a Gateway API implementation, the EnvoyProxy CRD or the HTTPRoute spec can configure these options. The allow_chunked_length option in http2_protocol_options permits Envoy to forward data without waiting for the content length to be known. This is critical for LLMs where the total token count is unknown at the start of the request. The read timeout must also be extended, typically via the timeout field in the HTTPRoute or the EnvoyProxy configuration.

The following table compares the critical configuration fields for NGINX and Envoy to enable streaming.

Component	Configuration Field	Required Value	Purpose
NGINX Ingress	`nginx.ingress.kubernetes.io/proxy-buffering`	`"off"`	Disables response buffering
NGINX Ingress	`nginx.ingress.kubernetes.io/proxy-read-timeout`	`"3600"`	Prevents timeout during generation
Envoy Proxy	`http2_protocol_options.allow_chunked_length`	`true`	Allows chunked transfer encoding
Envoy Proxy	`timeout`	`"3600s"`	Prevents connection close

Envoy configurations are often more verbose than NGINX annotations because they are defined in the control plane’s CRDs rather than the Ingress resource itself. For example, a Cilium EnvoyProxy configuration or a Gloo Proxy resource would include these settings. The key distinction is that Envoy does not have a simple annotation switch; the configuration must be applied at the listener or route level. This requires coordination between the platform team managing the gateway and the application team deploying the model.

Failure modes and symptoms

The primary symptom of a misconfigured ingress is a client that waits for the entire response before displaying any tokens. When a user sends a request to a chat endpoint, the UI shows a loading state for the duration of the generation. Once the generation completes, the full text appears instantly. This behavior is indistinguishable from a slow backend on the surface, but the latency profile reveals the cause.

To diagnose this, use curl with the -N flag to disable output buffering on the client side. Run the following command against the ingress endpoint while monitoring the time between the first and last token.

curl -N http://ingress.example.com/v1/chat/completions -H "Content-Type: application/json" -d '{"model": "llama-3", "prompt": "Hello", "stream": true}'

If the ingress is buffering, the curl command returns no data until the server closes the connection. The total time matches the backend generation time. If the ingress is configured correctly, curl outputs tokens as they arrive, and the total time is slightly longer than the backend generation time due to network latency.

Another failure mode is the 504 Gateway Timeout error. This occurs when the proxy-read-timeout is too low. The backend continues generating tokens, but the ingress controller closes the connection because the read timeout has expired. The client receives a partial response or an error message. This is common with default NGINX timeouts of 60 seconds when generating long contexts. The event logs for the ingress controller pods will show upstream timed out or connection reset by peer errors.

Decision frame

The next time a streaming LLM endpoint appears slow, the question is not whether the backend is under-provisioned, but whether the ingress controller is buffering the response. The tradeoff is between buffering for throughput and streaming for latency. Disabling buffering increases memory pressure on the ingress controller for large responses but enables real-time token delivery. The configuration must be explicit: proxy-buffering: "off" for NGINX and allow_chunked_length for Envoy. If the tokens do not arrive incrementally, check the ingress configuration before scaling the model.