NVIDIA’s GPU sharing modes expose a critical distinction: only Multi-Instance GPU provides hardware memory isolation for Kubernetes workloads.

The nvidia-device-plugin exposes three distinct mechanisms for multiple pods to claim a single physical card. Time-slicing allows multiple processes to run on the same GPU context sequentially. Multi-Process Service (MPS) allows concurrent kernel execution but shares memory space. Multi-Instance GPU (MIG) partitions the physical card into separate hardware instances with dedicated memory and compute units. The scheduler treats all three as nvidia.com/gpu resources by default, but only MIG enforces hard boundaries between tenants.

Platform engineers often select a sharing mode based on utilization targets rather than isolation requirements. This creates a risk where a noisy neighbor in a time-sliced pod can OOM another pod on the same card. Understanding the memory architecture of each mode is necessary to defend against resource contention failures.

The mechanism of GPU partitioning

Time-slicing is the default behavior when the nvidia-device-plugin is configured without MIG enabled. It exposes the full nvidia.com/gpu resource to multiple pods. The plugin tracks usage and schedules kernels sequentially. If Pod A requests 1 GPU and Pod B requests 1 GPU on a single physical card, the device plugin allows both to schedule. The CUDA context switches between processes.

MPS operates at the process level. It allows multiple processes to submit kernels to the GPU concurrently. The nvidia-container-runtime manages the MPS context. In Kubernetes, this is often enabled via a RuntimeClass or specific annotations on the nvidia-device-plugin DaemonSet. The memory space is shared, meaning one process can allocate memory that interferes with another.

MIG is hardware-level partitioning available on A100 and H100 GPUs. The physical card is sliced into instances. Each instance has its own memory, scheduler, and compute units. The nvidia-mig-manager or mig-parted component configures these slices. The pod requests a specific instance profile, such as nvidia.com/mig-1g.5gb, rather than a generic GPU.

The following table compares the isolation guarantees and resource exposure for each mode.

ModeMemory IsolationCompute IsolationResource NameFailure Symptom
Time-SlicingNoneNonenvidia.com/gpuOOM kill, context switch latency
MPSNonePartialnvidia.com/gpuMemory corruption, debug difficulty
MIGHardwareHardwarenvidia.com/mig-<profile>Profile mismatch, fragmentation

A standard pod request for time-slicing looks like a standard GPU request.

resources:
  limits:
    nvidia.com/gpu: 1

A MIG pod request specifies the instance profile.

resources:
  limits:
    nvidia.com/mig-1g.5gb: 1

The scheduler sees these as different resource types. A node with an A100 40GB card can expose up to 7 1g.5gb instances or 1 full nvidia.com/gpu. The nvidia-device-plugin calculates the allocatable capacity based on the configured profiles.

Memory math and failure modes

Memory isolation is the primary differentiator. In time-slicing, the driver enforces no memory limits between processes. If Pod A allocates 30GB on a 40GB card, Pod B will fail to allocate memory even if it has a pending request. The CUDA driver returns an out-of-memory error. The Kubernetes kubelet does not see this as a resource violation because the pod requested nvidia.com/gpu, which the node has. The failure happens inside the driver.

MPS shares the same memory space. Two processes can allocate overlapping addresses if not carefully managed. This is acceptable for trusted workloads on the same node. It is dangerous for multi-tenant clusters. A bug in one process can corrupt the memory of another. The nvidia-container-runtime handles the context, but it does not sandbox the memory space.

MIG enforces memory limits at the hardware level. If an instance is configured for 5GB, the driver will not allow allocation beyond 5GB. If the workload exceeds this, the process crashes, but the other instances on the card remain unaffected. This is the only mode that prevents a noisy neighbor from taking down the card.

Fragmentation is the failure mode for MIG. An A100 40GB card supports specific profiles. If 6 1g.5gb instances are active, the remaining 10GB cannot be used for a 2g.10gb instance. The mig-parted configuration determines which profiles are available. If the configuration is static, the cluster cannot dynamically adapt to workload changes.

The nvidia-device-plugin exposes metrics via Prometheus. The DCGM_FI_DEV_GPU_UTIL metric shows utilization. For MIG, metrics are exposed per instance. A failure in one instance does not impact the metrics of others. For time-slicing, a spike in one pod masks the utilization of others on the same card.

The decision frame

The choice between these modes is a tradeoff between utilization density and isolation safety. Time-slicing maximizes density but accepts the risk of OOM kills. MPS improves throughput for trusted workloads but removes memory boundaries. MIG guarantees isolation but introduces fragmentation constraints.

The question the next time a GPU pod stays Pending is not “is the node full.” It is “did the pod request a partition the node actually exposes.” If the node has 40GB of memory available but the nvidia-device-plugin is configured for only 1g.5gb profiles, a pod requesting nvidia.com/gpu will not schedule. Read the nvidia.com/mig-<profile> allocation on the node, not the total GPU count. Fragmentation is the silent killer of MIG clusters.