The Problem — Why This Matters
When you scale a Deployment to multiple replicas, Kubernetes decides where each pod lands. Without any constraints, the scheduler may place all replicas on the same node — repeatedly.
⚖️ Poor Resource Distribution
One node gets overloaded while others sit idle. CPU and memory are wasted across the cluster.
❌ Lower Fault Tolerance
If all pods are on one node, a node failure means zero replicas survive. Your app goes down entirely.
💥 Node Failure = Full Outage
The whole point of multiple replicas is HA. Co-locating them completely defeats that purpose.
Core Concepts
Understanding these four building blocks makes the YAML configuration obvious rather than mysterious.
🔑 topologyKey
The node label used to define "groups." Use kubernetes.io/hostname to spread per node, or topology.kubernetes.io/zone to spread across availability zones.
📐 maxSkew
Maximum allowed difference in pod count between the most-loaded and least-loaded group. maxSkew: 1 enforces near-perfect balance.
🏷️ labelSelector
Tells Kubernetes which pods to count when calculating distribution. Only pods matching these labels are considered.
🛡️ whenUnsatisfiable
What to do when the constraint can't be satisfied. Either ScheduleAnyway (soft) or DoNotSchedule (strict/hard).
maxSkew as the "allowed imbalance budget." If maxSkew is 1, no node can have more than 1 extra pod compared to the least loaded node. If it's 3, a difference of up to 3 is tolerated before Kubernetes starts routing around it.
Supported topologyKey Values
| topologyKey | Spreads Across | Use When |
|---|---|---|
| kubernetes.io/hostname | Individual nodes | ✓ Most Common |
| topology.kubernetes.io/zone | AZ / datacenter zones | Cloud HA |
| topology.kubernetes.io/region | Cloud regions | Multi-region |
| Custom label | Any node group you define | Advanced |
Prerequisites
- Kubernetes 1.19+ —
topologySpreadConstraintsbecame GA (stable) in v1.19 - Multiple nodes — spreading has no effect on a single-node cluster
- Pod labels set correctly — the
labelSelectormust match your pod labels exactly - kubectl access — or Helm if using the chart-based approach
topologySpreadConstraints is silently ignored or may cause pods to stay Pending if using DoNotSchedule. Always test on a multi-node setup.
Direct YAML Deployment
Add topologySpreadConstraints directly inside your Deployment manifest under spec → template → spec.
spec.template.spec, at the same level as containers. It is not in the top-level spec of the Deployment — it's the pod spec inside the template.
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app
spec:
replicas: 6
selector:
matchLabels:
app: my-app
template:
metadata:
labels:
app: my-app
spread-group: "app" # ← required for labelSelector
spec:
# ──────────────────────────────────────────
# ADD THIS BLOCK under spec (pod spec level)
# ──────────────────────────────────────────
topologySpreadConstraints:
- maxSkew: 3
topologyKey: "kubernetes.io/hostname"
whenUnsatisfiable: "ScheduleAnyway"
labelSelector:
matchLabels:
spread-group: "app"
containers:
- name: my-app
image: my-app:latest
Multi-Constraint Example (Node + Zone)
You can stack multiple constraints to spread across both zones and individual nodes simultaneously:
topologySpreadConstraints:
# Constraint 1: spread across availability zones
- maxSkew: 1
topologyKey: "topology.kubernetes.io/zone"
whenUnsatisfiable: "DoNotSchedule"
labelSelector:
matchLabels:
spread-group: "app"
# Constraint 2: also spread across individual nodes
- maxSkew: 2
topologyKey: "kubernetes.io/hostname"
whenUnsatisfiable: "ScheduleAnyway"
labelSelector:
matchLabels:
spread-group: "app"
Helm Chart Setup
For Helm-managed deployments, parameterize the constraints via values.yaml so they can be tuned per environment without touching the template.
Update templates/deployment.yaml
Replace hardcoded values with Helm template variables referencing .Values.podSpread.
spec:
template:
metadata:
labels:
spread-group: "{{ .Values.podSpread.label }}"
spec:
topologySpreadConstraints:
- maxSkew: {{ .Values.podSpread.maxSkew }}
topologyKey: {{ .Values.podSpread.topologyKey }}
whenUnsatisfiable: {{ .Values.podSpread.whenUnsatisfiable }}
labelSelector:
matchLabels:
spread-group: "{{ .Values.podSpread.label }}"
Define Defaults in values.yaml
These are the default values. Override them per environment using -f values-prod.yaml or --set flags.
podSpread:
maxSkew: 3
topologyKey: "kubernetes.io/hostname"
whenUnsatisfiable: "ScheduleAnyway"
label: "app"
Environment-Specific Overrides
Use a values-prod.yaml to enforce stricter spreading in production without changing the base chart.
# Production: stricter HA requirements
podSpread:
maxSkew: 1
topologyKey: "kubernetes.io/hostname"
whenUnsatisfiable: "DoNotSchedule"
label: "app"
# Deploy with production values:
helm upgrade --install my-app ./chart \
-f values.yaml \
-f values-prod.yaml
Required: Add Labels to Your Pods
This is mandatory and the most commonly missed step. The labelSelector in your constraint only works if your pods actually have the matching label. Without it, Kubernetes has no idea which pods to count when calculating spread.
# This label MUST exist on every pod you want to spread
metadata:
labels:
app: my-app
spread-group: "app" # ← this is what labelSelector matches
labelSelector.matchLabels, the constraint is effectively ignored. Kubernetes won't error — it will just not spread. Always double-check the label key and value are identical.
spread-group: app instead of just app: my-app gives you flexibility. If you later want to spread multiple different deployments as a single group (or exclude some pods), you can do so by controlling who gets this label.
Understanding maxSkew
maxSkew is the core numeric parameter that controls how uneven the distribution can be. It represents the maximum allowed difference in pod count between any two topology domains (nodes).
Example with 3 Nodes and maxSkew: 3
✓ Allowed Distribution
Example with 3 Nodes — Exceeds maxSkew: 3
✗ Not Allowed (Strict Mode)
maxSkew Quick Reference
| maxSkew Value | Behaviour | Best For |
|---|---|---|
| 1 | Near-perfect balance. Strict. | Production HA |
| 2 | Moderate imbalance allowed. | Staging |
| 3 | Flexible. Default recommendation. | Most Deployments |
| 5+ | Very relaxed. Near no-op on small clusters. | Dev / Low Priority |
whenUnsatisfiable — Soft vs Strict
This field controls what Kubernetes does when it cannot satisfy the spread constraint — either because there aren't enough nodes, or because all valid nodes would violate maxSkew.
- Scheduler tries its best to balance
- If impossible, pod is scheduled anyway
- No pods will stay Pending
- Imbalance may occur under pressure
- Ideal for non-critical workloads
- Constraint is a hard requirement
- Pod stays Pending if constraint breaks
- Guarantees balanced distribution
- Risk: pods stuck if cluster is unbalanced
- Required for strict HA production apps
DoNotSchedule but don't have enough nodes (or nodes are full/tainted), your pods will stay in Pending state indefinitely. Always ensure you have enough eligible nodes when using strict mode.
topologySpreadConstraints:
- maxSkew: 1 # strict balance
topologyKey: "kubernetes.io/hostname"
whenUnsatisfiable: "DoNotSchedule" # hard constraint
labelSelector:
matchLabels:
spread-group: "app"
Cross-Namespace Behavior
A common point of confusion: what happens when the same label exists in multiple namespaces?
topologySpreadConstraints only counts pods within the same namespace as the pod being scheduled. Pods in other namespaces are invisible to the calculation, even if they share the same labels.
Namespace A
Pods with spread-group: app are spread across nodes within Namespace A only.
Namespace B
Pods with the same label spread independently within Namespace B only. Unaware of Namespace A.
Cross-NS
Kubernetes does not balance pods across namespaces. This is not supported natively.
namespaceSelector inside labelSelector that allows cross-namespace awareness. It requires enabling the TopologySpreadConstraintsNodeSpecificNamespaceSelector feature gate. Not recommended for production yet.
# If you deploy app with same labels in 2 namespaces:
Namespace A: pods on Node1(3), Node2(1), Node3(2) → balanced ✓
Namespace B: pods on Node1(4), Node2(0), Node3(0) → balanced ✓ (within B)
# Node1 might end up with 3+4 = 7 pods total
# Neither namespace's constraint prevents this
# ❌ topologySpreadConstraints does NOT account for this
How the Scheduler Works Internally
When a new pod needs to be scheduled, the Kubernetes scheduler runs through this logic for each topologySpreadConstraint:
Find Matching Pods
The scheduler searches for all existing pods that match the labelSelector in the same namespace. These are the pods it will count for distribution.
Group by topologyKey
Pods are grouped by the value of the node label specified in topologyKey. Each unique value (hostname, zone, etc.) forms one "topology domain."
Calculate Skew Per Domain
For each candidate node, the scheduler calculates what the skew would be if the new pod were placed there: current count + 1 − min count.
# After placing pod on candidate node:
skew = (count_on_candidate + 1) - min_count_across_all_nodes
# If skew > maxSkew → node is ineligible
Place on the Least Loaded Eligible Node
Among all nodes that don't violate maxSkew, the pod is placed on the one with the fewest matching pods. This actively drives distribution toward balance.
Pro Tips & Combinations
- Use
maxSkew: 1for strict HA setups where every pod must land on a different node - Use
DoNotSchedulefor production-critical apps — never let the scheduler ignore the constraint - Combine with
podAntiAffinityfor even stronger node separation (belt + suspenders approach) - Ensure you have at least as many eligible nodes as replicas when using
DoNotSchedule - Test your spread with
kubectl get pods -o wideto verify node distribution after deploy - For multi-AZ clusters, stack two constraints: one for hostname, one for zone
Combining with podAntiAffinity
For the strongest possible node separation, combine topologySpreadConstraints with a podAntiAffinity rule. The spread constraint handles even distribution; anti-affinity provides a hard barrier.
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: "kubernetes.io/hostname"
whenUnsatisfiable: "DoNotSchedule"
labelSelector:
matchLabels:
spread-group: "app"
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app: my-app
topologyKey: "kubernetes.io/hostname"
Quick Verification Command
# See which node each pod landed on:
kubectl get pods -l app=my-app -o wide
# Count pods per node:
kubectl get pods -l spread-group=app \
-o jsonpath='{range .items[*]}{.spec.nodeName}{"\n"}{end}' \
| sort | uniq -c | sort -rn
Final Outcome
With topologySpreadConstraints properly configured, your cluster scheduling is now deterministic and resilient.
🚀 What You've Gained
⚙️ What Was Configured
Key Takeaways
- Label your pods —
spread-grouplabel is mandatory for the constraint to function - Start with ScheduleAnyway — then tighten to DoNotSchedule once cluster is stable
- maxSkew: 1 for production, maxSkew: 3 for flexible dev/staging environments
- Namespace-scoped by default — cross-namespace spreading is not supported natively
- Stack constraints — combine hostname + zone constraints for multi-AZ resilience
- Verify with kubectl — always confirm actual distribution after deploying
topologyKey: topology.kubernetes.io/zone to spread across availability zones. This is the foundation of truly resilient multi-AZ Kubernetes deployments.