KServe introduces an InferenceService CRD that manages model loading and traffic routing, while a plain Deployment manages only container replicas.
The standard Kubernetes Deployment controller ensures a specified number of Pod replicas are running and healthy. It does not understand the concept of a machine learning model. It does not know how to route traffic between a new model version and an old one, nor does it natively handle the latency of loading a multi-gigabyte model into GPU memory. KServe sits on top of Kubernetes and Knative Serving to abstract these concerns. It defines a higher-level abstraction where the user specifies the model artifact and the serving protocol, and the system handles the underlying infrastructure lifecycle.
This abstraction creates a distinct operational boundary. A Deployment is a stateless controller for containers. An InferenceService is a stateful controller for model inference pipelines. The InferenceService spec defines logical components such as the predictor, transformer, and explainer. The system translates these logical roles into the necessary Kubernetes resources, including Knative Services and underlying Deployments. This translation layer is where the value proposition and the operational cost both reside.
The InferenceService manifest
The InferenceService resource defines the inference endpoint. The spec separates the model configuration from the infrastructure configuration. A minimal predictor configuration specifies the model name, the storage URI, and the framework.
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: sklearn-iris
spec:
predictor:
model:
modelFormat:
name: sklearn
storageUri: "gs://kfserving-examples/models/sklearn/1.0/model"
resources:
limits:
cpu: "2"
memory: "4Gi"
This manifest triggers the creation of a Knative Service. The Service resource in Knative is different from the Kubernetes Service resource. It manages traffic routing and revisioning. When the InferenceService is applied, KServe reconciles the spec and generates a Knative Configuration. The Configuration creates a Revision, which is a snapshot of the serving configuration. The Revision creates the underlying Kubernetes Deployment and Service objects that run the actual container.
The relationship is hierarchical. The InferenceService owns the Knative Service. The Knative Service owns the Configuration. The Configuration owns the Revision. The Revision owns the Deployment. This chain of ownership allows KServe to manage traffic splits and rollbacks without modifying the underlying Pod spec directly. Traffic is shifted at the Knative Service level, not by changing replica counts in the Deployment.
Scaling and lifecycle mechanics
Knative Serving implements scale-to-zero logic that differs from Kubernetes Horizontal Pod Autoscaler (HPA). HPA scales based on CPU utilization or custom metrics. Knative scales based on request concurrency. When concurrency drops to zero, the Knative Revision scales down to zero replicas.
A plain Deployment with spec.replicas: 0 remains at zero until a user or controller changes the spec. HPA can set minReplicas: 0, but it does not automatically scale to zero when idle; it waits for the metric threshold to trigger. Knative scales to zero immediately after the last request completes, subject to a minScale configuration.
The following table compares the scaling behavior of a standard Deployment versus a KServe-managed InferenceService.
| Feature | Kubernetes Deployment | KServe InferenceService |
|---|---|---|
| Scaling Trigger | CPU/Memory/Custom Metric | Request Concurrency |
| Scale-to-Zero | Manual or HPA minReplicas: 0 | Automatic (Concurrent = 0) |
| Traffic Routing | Round-robin via Service IP | Traffic Splitting via Knative Revision |
| Model Loading | On container start (cold start) | On Revision activation (cold start) |
| Resource Overhead | Low (Controller per Deployment) | Higher (Controller per InferenceService) |
The overhead comes from the additional controllers. KServe runs a controller manager that watches InferenceService objects. Knative runs controllers that watch Service, Configuration, and Revision objects. For a cluster with hundreds of inference endpoints, this control plane load is measurable. However, the resource savings on idle GPU time often outweigh the control plane cost.
Failure modes and cold starts
The most significant failure mode for KServe is the GPU cold start. When a Knative Revision scales from zero to one, the container starts and the model loads into memory. For a large language model or a deep learning model, this initialization is not instantaneous.
On GPU nodes, the model must be loaded into VRAM. This process involves memory allocation, data transfer from disk, and framework initialization. A 70B parameter model in FP16 requires approximately 140GB of VRAM. Loading this data can take 30 to 90 seconds depending on the storage backend and GPU bandwidth. During this time, the Pod is not ready to serve traffic.
The InferenceService controller sets a readiness probe on the underlying Deployment. If the probe fails, traffic is not routed to the Pod. However, the client requesting the inference sees a timeout. The system does not queue requests while the model loads. The first request after a scale-up will fail or timeout unless the client implements retry logic with exponential backoff.
Another failure mode involves resource contention. When multiple InferenceServices scale up simultaneously on a shared node pool, they compete for GPU memory. If the node has 4×A100 GPUs and two 70B models try to load, the second load will fail due to Insufficient nvidia.com/gpu. KServe does not manage node-level resource quotas across different InferenceService objects. The cluster autoscaler or node pool configuration must handle the capacity planning.
Knative also enforces a maximum concurrency per Pod. If the concurrency limit is set too low, the system scales up more pods than necessary. If set too high, the model server may become unresponsive under load. This configuration is set in the InferenceService spec under spec.predictor.minScale and maxScale, but the effective concurrency is also constrained by the containerConcurrency setting in the Knative Service.
Decision frame
The choice between KServe and a plain Deployment is not about features but about operational overhead versus resource efficiency. A plain Deployment is sufficient when the inference endpoint is always hot and the model is small. It requires fewer controllers and less latency to start.
The tradeoff is in the idle cost. If the endpoint serves traffic intermittently, a Deployment with minReplicas: 1 burns GPU resources 24/7. KServe scales to zero, saving the cost of the GPU while the endpoint is idle. The operator must decide if the 30-90 second cold start penalty is acceptable for the target use case. For batch processing or internal tools, the wait is tolerable. For real-time user-facing APIs, the latency spike is a user experience failure.
The decision is whether the cluster can absorb the control plane complexity of Knative to save on GPU idle time. If the platform team cannot manage the additional CRDs and scaling logic, the operational risk of KServe outweighs the cost savings. If the GPU budget is the primary constraint and the user experience tolerates latency, KServe is the mechanism that enforces the scale-to-zero policy.