cluster N

Scheduler internals, GPU partitioning, autoscaling, node lifecycle — the layer that decides where a workload lands.

cluster·May 13, 2026

How priority and preemption interact with GPU pods

PriorityClass values determine which pods survive resource contention; preemption can evict long-running training jobs for short inference requests.

cluster·May 13, 2026

Why GPU nodes need taints, even on a single-tenant cluster

The Kubernetes scheduler places CPU-only pods on GPU nodes unless taints block them, wasting expensive hardware capacity.

cluster·May 13, 2026

How node-feature-discovery actually labels GPU nodes

Node Feature Discovery labels nodes with the exact GPU model string returned by the driver, and pod selectors must match that string exactly to schedule workloads.

cluster·May 13, 2026

Why topology-aware placement matters for NCCL, and how to express it

Distributed training all-reduce throughput depends on whether pods land in the same rack, switch, or NVLink domain. PodAffinity and TopologySpreadConstraints control this, but the topology key must match node labels.

cluster·May 13, 2026

Eviction signals interrupt training checkpoints

Node-pressure eviction sends SIGTERM to training pods, interrupting checkpoint writes before the process receives SIGKILL.

cluster·May 13, 2026

Why GPU resource quotas behave differently than CPU quotas

ResourceQuota enforces GPU limits independently of scheduler requests, causing OutOfQuota events when LimitRange injects default values.

cluster·May 13, 2026

What gang scheduling actually guarantees with Volcano

Volcano's PodGroup CRD gates pod placement until a minimum number of replicas fit, preventing distributed training jobs from deadlocking on partial allocation.

cluster·May 13, 2026

MIG, MPS, and time-slicing — three ways to share a GPU, only one of them is isolation

NVIDIA's GPU sharing modes expose a critical distinction: only Multi-Instance GPU provides hardware memory isolation for Kubernetes workloads.

cluster·May 13, 2026

How the default scheduler scores nodes for GPU pods

The default scheduler assigns GPU pods using a weighted scoring system where NodeResourcesFit and InterPodAffinity outweigh custom pod annotations.