Network policy isolation for multi-tenant AI workloads

A NetworkPolicy that selects a pod without an explicit allow rule blocks all ingress traffic, but does not block egress by default.

Kubernetes NetworkPolicy is a specification for traffic flow between pods, namespaces, and external endpoints. It is a declarative CRD that defines which connections are permitted, not which are denied. By default, a cluster allows all traffic between all pods. A NetworkPolicy changes this only for the pods it selects. If no policy selects a pod, that pod remains open to all ingress and egress.

This distinction creates a security gap for multi-tenant clusters. Platform teams often apply a policy to a specific team’s namespace to restrict access, assuming the cluster is locked down. The cluster is not locked down unless a default-deny policy exists and the CNI enforces it. AI workloads add complexity because they require high-bandwidth connections for model training and low-latency connections for inference, both of which must remain open while restricting lateral movement between tenants.

Enforcing default deny

The Kubernetes spec does not define a default-deny state. To achieve zero-trust isolation, a cluster must apply a policy that selects all pods and denies all traffic. This policy acts as a baseline. Any subsequent policy for a specific pod must explicitly allow the traffic that baseline denies.

The following YAML creates a default-deny policy for ingress. It selects all pods in the namespace using an empty podSelector.

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-ingress
  namespace: tenant-a
spec:
  podSelector: {}
  policyTypes:
  - Ingress

This policy denies all incoming traffic to every pod in tenant-a. To allow egress, a separate policy is required. Egress is not denied by default in Kubernetes, even when ingress is denied. AI workloads often require egress to pull models from a registry, write checkpoints to object storage, or communicate with a shared control plane.

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-egress
  namespace: tenant-a
spec:
  podSelector: {}
  policyTypes:
  - Egress

Applying both policies creates a zero-trust zone. The CNI plugin enforces these rules. Standard Kubernetes does not enforce them; the CNI must translate the CRD into iptables rules or eBPF programs. If the CNI does not support NetworkPolicy, the YAML will apply but traffic will flow unimpeded.

Allowing essential egress

AI workloads fail silently when DNS is blocked. A pod cannot resolve service names without access to the cluster DNS service. This service typically runs in the kube-system namespace as CoreDNS or kube-dns. A default-deny egress policy will block DNS queries unless explicitly allowed.

The following policy allows egress to the DNS service. It uses namespaceSelector to target the kube-system namespace and podSelector to target the DNS pods.

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-dns
  namespace: tenant-a
spec:
  podSelector: {}
  policyTypes:
  - Egress
  egress:
  - to:
    - namespaceSelector:
        matchLabels:
          kubernetes.io/metadata.name: kube-system
    - podSelector:
        matchLabels:
          k8s-app: kube-dns
    ports:
    - protocol: UDP
      port: 53
    - protocol: TCP
      port: 53

This configuration is critical for training and inference pods. Without it, pods cannot resolve endpoints for model registries or internal APIs. The policy must be applied to the tenant namespace alongside the default-deny policy.

Different CNI implementations handle NetworkPolicy enforcement differently. The table below compares the enforcement mechanisms for common CNIs used in AI clusters.

CNI	Enforcement Mechanism	Default Deny Support	Notes
Cilium	eBPF / iptables	Yes	High performance; supports L7 policies
Calico	iptables / eBPF	Yes	Widely supported; robust policy engine
Kube-Proxy (Standard)	iptables	No	Does not enforce NetworkPolicy
AWS VPC CNI	Security Groups	Limited	Requires node-level security groups

Cilium and Calico are the primary choices for enforcing these policies in production. AWS VPC CNI requires additional configuration to map Kubernetes NetworkPolicy to AWS Security Groups. Kube-Proxy alone does not enforce NetworkPolicy.

Verifying isolation

Testing isolation requires a pod that attempts to connect to a target that should be blocked. The kubectl run command can launch a temporary pod to verify connectivity.

To test ingress isolation, run a pod in tenant-b and attempt to connect to a service in tenant-a.

kubectl run test-pod --image=busybox --rm -it --namespace=tenant-b -- sh

Inside the pod, attempt to curl the service in tenant-a.

wget -q --spider http://tenant-a-service:8080

If the NetworkPolicy is enforced, the connection will time out or be refused. If the connection succeeds, the policy is not enforced. This test must be run from a pod outside the target namespace.

To test egress isolation, run a pod in tenant-a and attempt to connect to an external IP.

kubectl run test-pod --image=busybox --rm -it --namespace=tenant-a -- sh

Inside the pod, attempt to curl an external service.

wget -q --spider http://8.8.8.8

If the egress policy is enforced, the connection will time out. If the policy allows only specific egress, the external IP should fail.

Failure modes

The most common failure mode is DNS resolution breaking after applying default-deny egress. The kube-dns service is in kube-system, which is outside the tenant namespace. A policy that denies egress without allowing kube-system will cause every pod in the tenant to fail to resolve service names. This manifests as CNAME lookup failures in application logs. The fix is to add the allow-dns policy shown above.

Another failure mode is assuming egress is blocked by default. Kubernetes allows all egress unless a policy explicitly denies it. A team might apply an ingress-only default-deny policy and assume they are secure. An attacker or compromised pod in the tenant namespace can still exfiltrate data to external endpoints. This is not a configuration error in the policy but a misunderstanding of the Kubernetes default state.

CNI enforcement gaps are the third failure mode. If the cluster uses a CNI that does not support NetworkPolicy, the policies will apply without error but will not block traffic. The kubectl get networkpolicy command will show the policy as active, but kubectl run tests will show connectivity. The operator must verify the CNI supports NetworkPolicy before relying on it for isolation.

Decision frame

The tradeoff is between zero-trust isolation and the operational cost of maintaining explicit allow-lists for every service.

The question the next time a tenant complains of connectivity issues is not ‘is the policy too strict.’ It is ‘did the policy allow DNS and the specific egress required for the workload.’ Default-deny egress breaks everything until the allow-list is complete. If the cluster has 20 tenants, the operational bill for maintaining 20 DNS policies and 20 egress allow-lists is high. The decision is whether to accept that cost for the security benefit of preventing lateral movement, or to rely on network segmentation at the node level instead.