cluster N
Scheduler internals, GPU partitioning, autoscaling, node lifecycle — the layer that decides where a workload lands.
How priority and preemption interact with GPU pods
PriorityClass values determine which pods survive resource contention; preemption can evict long-running training jobs for short inference requests.
Why GPU nodes need taints, even on a single-tenant cluster
The Kubernetes scheduler places CPU-only pods on GPU nodes unless taints block them, wasting expensive hardware capacity.
How node-feature-discovery actually labels GPU nodes
Node Feature Discovery labels nodes with the exact GPU model string returned by the driver, and pod selectors must match that string exactly to schedule workloads.
Why topology-aware placement matters for NCCL, and how to express it
Distributed training all-reduce throughput depends on whether pods land in the same rack, switch, or NVLink domain. PodAffinity and TopologySpreadConstraints control this, but the topology key must match node labels.
Eviction signals interrupt training checkpoints
Node-pressure eviction sends SIGTERM to training pods, interrupting checkpoint writes before the process receives SIGKILL.
Why GPU resource quotas behave differently than CPU quotas
ResourceQuota enforces GPU limits independently of scheduler requests, causing OutOfQuota events when LimitRange injects default values.
What gang scheduling actually guarantees with Volcano
Volcano's PodGroup CRD gates pod placement until a minimum number of replicas fit, preventing distributed training jobs from deadlocking on partial allocation.
MIG, MPS, and time-slicing — three ways to share a GPU, only one of them is isolation
NVIDIA's GPU sharing modes expose a critical distinction: only Multi-Instance GPU provides hardware memory isolation for Kubernetes workloads.
How the default scheduler scores nodes for GPU pods
The default scheduler assigns GPU pods using a weighted scoring system where NodeResourcesFit and InterPodAffinity outweigh custom pod annotations.