Why the Descheduler Exists
The Kubernetes scheduler makes placement decisions at pod creation time. Once placed, it never revisits. This causes a well-known cluster drift problem over time.
📈 Node Drift
Nodes added later to the cluster are underutilised — the scheduler filled older nodes first and new pods may still prefer them.
🔥 Memory Pressure
One node silently fills up with pods while others are idle. OOMKill events and throttling start appearing on the busy node.
💀 Constraint Violations
After node taints, affinity rules, or topology constraints are added, existing pods may violate them — but they're never moved.
🔁 Restart Loops
CrashLooping pods stay on a bad node forever. The descheduler can evict them so they land on a healthier node.
topologySpreadConstraints. Spread constraints prevent bad placement at creation time. The Descheduler fixes bad placement that already happened. Together they give you complete, continuous cluster balance. See the topologySpreadConstraints guide for the other half of this setup.
Core Concepts
🔁 Eviction
The descheduler doesn't move pods directly. It evicts them (graceful delete), and the scheduler re-places them on a better node.
📋 Profiles
A named set of plugins that define what to check and evict. You can have multiple profiles with different policies.
🔌 Plugins
Individual rules — e.g. LowNodeUtilization, RemoveDuplicates. Each plugin checks one specific condition and evicts pods that violate it.
🛡️ DefaultEvictor
A mandatory plugin that guards which pods are evictable — it prevents evicting system-critical pods, DaemonSets, or pods with local storage.
Two Plugin Types
| Type | What It Does | Example Plugins |
|---|---|---|
| balance | Evicts to redistribute pods more evenly across nodes | LowNodeUtilization, RemoveDuplicates |
| deschedule | Evicts pods violating constraints or cluster rules | RemovePodsViolatingNodeTaints, RemovePodsViolatingInterPodAntiAffinity |
Prerequisites
- Kubernetes 1.19+ — Descheduler v0.30+ requires at least Kubernetes 1.26 for full feature support
- Helm 3 — used for installation in this guide; see note below for other methods
- kubectl — configured with cluster admin access to edit ConfigMaps in
kube-system - Multiple worker nodes — rebalancing has no effect on single-node clusters
- PodDisruptionBudgets set — recommended for production so evictions are graceful
kubernetes-sigs.github.io/descheduler.
Install via Helm
The simplest and most reliable install method is via the official Descheduler Helm chart. This also makes configuration upgrades straightforward later.
Add the Descheduler Helm Repository
helm repo add descheduler https://kubernetes-sigs.github.io/descheduler/
helm repo update
Install into kube-system Namespace
Installing in kube-system is conventional since the descheduler is a cluster-level component. You can install it in any namespace — just keep the -n flag consistent in all future commands.
helm install my-release \
--namespace kube-system \
descheduler/descheduler
kube-system is used here because the descheduler needs cluster-wide visibility. You can use a dedicated namespace like descheduler if your org prefers separation — just ensure the ServiceAccount has the correct ClusterRole bindings.
Verify the Installation
# Check descheduler pod is Running:
kubectl get pods -n kube-system | grep descheduler
# See all resources created by the Helm release:
kubectl get all -n kube-system -l app.kubernetes.io/name=descheduler
# Confirm ConfigMap was created:
kubectl get configmap -n kube-system | grep descheduler
my-release-descheduler-XXXXX with status Running, and a ConfigMap named my-release-descheduler. The descheduler runs as a Deployment by default (polling mode — runs every 5 minutes).
View the Default ConfigMap
After installation, a ConfigMap is created with the default policy. This is what runs out of the box — before you make any changes.
kubectl -n kube-system edit configmap my-release-descheduler
apiVersion: v1
data:
policy.yaml: |
apiVersion: "descheduler/v1alpha2"
kind: "DeschedulerPolicy"
profiles:
- name: default
pluginConfig:
- args:
podProtections:
defaultDisabled:
- PodsWithLocalStorage
extraEnabled:
- PodsWithPVC
name: DefaultEvictor
- name: RemoveDuplicates
- args:
includingInitContainers: true
podRestartThreshold: 100
name: RemovePodsHavingTooManyRestarts
- args:
nodeAffinityType:
- requiredDuringSchedulingIgnoredDuringExecution
name: RemovePodsViolatingNodeAffinity
- name: RemovePodsViolatingNodeTaints
- name: RemovePodsViolatingInterPodAntiAffinity
- name: RemovePodsViolatingTopologySpreadConstraint
- args:
targetThresholds:
cpu: 50
memory: 50
pods: 50
thresholds:
cpu: 20
memory: 20
pods: 20
name: LowNodeUtilization
plugins:
balance:
enabled:
- RemoveDuplicates
- RemovePodsViolatingTopologySpreadConstraint
- LowNodeUtilization
deschedule:
enabled:
- RemovePodsHavingTooManyRestarts
- RemovePodsViolatingNodeTaints
- RemovePodsViolatingNodeAffinity
- RemovePodsViolatingInterPodAntiAffinity
kind: ConfigMap
metadata:
name: my-release-descheduler
namespace: kube-system
What the Default Config Does
The default config is conservative by design. It uses very low thresholds (20%) to trigger eviction, meaning a node has to be nearly empty before pods are moved away from a busy one. For most real clusters with memory pressure, this is insufficient.
Protects pods with local storage from eviction. Allows eviction of pods with PVCs. This is the safety gate for all other plugins.
Evicts duplicate pods (same owner) that land on the same node — leaves one, evicts the rest so they reschedule elsewhere.
TooManyRestarts
Evicts pods that have restarted more than 100 times. Helps crashlooping pods escape a bad node.
NodeAffinity
Evicts pods that no longer satisfy their requiredDuringScheduling node affinity rules (e.g. node label was removed).
NodeTaints
Evicts pods running on nodes they should no longer be on after new taints were added post-scheduling.
InterPodAntiAffinity
Evicts pods co-located with pods they have anti-affinity rules against — fixes violations that arose after pod placement.
TopologySpreadConstraint
Evicts pods that violate existing topologySpreadConstraints — the runtime enforcement partner to your spread config.
Evicts pods from overutilised nodes (above targetThresholds) so they land on underutilised ones (below thresholds). The key plugin for memory pressure.
thresholds: cpu/memory/pods: 20 and targetThresholds: 50, eviction only triggers if a node is under 20% utilisation AND another is over 50%. In a busy cluster where most nodes run at 60–90%, this almost never fires. You need to raise the thresholds to match your actual usage.
Apply the Custom Configuration
Edit the ConfigMap directly with kubectl. The key change is raising the thresholds to match real-world node utilisation — these values represent pod utilisation percentages on the node, not VM-level CPU/RAM percentages.
thresholds.cpu: 90 means "a node is considered underutilised if its pods collectively use less than 90% of allocatable CPU." This is about what pods request and use, not the raw VM CPU reading. Tune these after observing your actual cluster's average utilisation with kubectl top nodes.
kubectl -n kube-system edit configmap my-release-descheduler
Replace the entire policy.yaml value with the following custom configuration:
apiVersion: v1
data:
policy.yaml: |
apiVersion: "descheduler/v1alpha2"
kind: "DeschedulerPolicy"
# Exclude control-plane / master nodes from descheduling
nodeSelector: "node-role.kubernetes.io/master!=true,node-role.kubernetes.io/control-plane!=true"
profiles:
- name: default
pluginConfig:
- name: "DefaultEvictor"
args:
nodeFit: true # only evict if a better node actually exists
- name: "RemovePodsViolatingInterPodAntiAffinity"
- name: "RemoveDuplicates"
- name: "LowNodeUtilization"
args:
thresholds:
"cpu" : 90 # node < 90% CPU usage → underutilised (evict TO here)
"memory" : 80 # node < 80% memory → underutilised
"pods" : 40 # node < 40% pod count → underutilised
targetThresholds:
"cpu" : 90 # node > 90% CPU usage → overutilised (evict FROM here)
"memory" : 85 # node > 85% memory → overutilised
"pods" : 50 # node > 50% pod count → overutilised
plugins:
balance:
enabled:
- "LowNodeUtilization"
- "RemoveDuplicates"
deschedule:
enabled:
- "RemovePodsViolatingInterPodAntiAffinity"
kind: ConfigMap
metadata:
name: my-release-descheduler
namespace: kube-system
Default vs Custom — Side-by-Side
Here's exactly what changed between the two configs and why each change matters.
Why These Changes Improve Production Stability
🎯 Realistic Thresholds
Default 20/50 barely ever fires on busy clusters. 90/85 matches actual node usage — eviction triggers when nodes are genuinely overloaded.
🛡️ nodeFit: true
Without this, a pod can be evicted and then fail to reschedule (Pending). nodeFit: true ensures a valid destination exists before eviction happens.
🚫 Control-Plane Excluded
The nodeSelector prevents the descheduler from touching control-plane nodes — critical in clusters where masters also run workloads.
🔌 Focused Plugin Set
Fewer plugins means fewer unexpected evictions. Start minimal and add plugins back one by one once you understand the eviction behaviour on your cluster.
Understanding LowNodeUtilization Thresholds
LowNodeUtilization uses two threshold bands. A node is underutilised if it is below all thresholds values. A node is overutilised if it exceeds any targetThresholds value. Pods are evicted from overutilised nodes and land on underutilised ones.
cpu: 90 does not mean 90% of the host machine's CPU. It means 90% of the node's allocatable CPU as seen by Kubernetes — based on pod requests and actual usage. Always baseline with kubectl top nodes before setting these values.
Custom Threshold Visualisation
| Resource | Threshold | TargetThreshold | Meaning |
|---|---|---|---|
| CPU | 90% | 90% | Evict pods from nodes using >90% CPU to nodes using <90% |
| Memory | 80% | 85% | Evict pods from nodes using >85% memory to nodes using <80% |
| Pods | 40% | 50% | Evict pods from nodes above 50% pod capacity to nodes below 40% |
kubectl top nodes and note the average CPU% and Memory% across your worker nodes. Set targetThresholds about 5–10% above your average. Set thresholds about 10–15% below your average. This creates a healthy eviction band without thrashing.
Rollout Restart to Apply Config Changes
After editing the ConfigMap, the running descheduler pod does not automatically pick up the new config. You must restart the Deployment to reload the ConfigMap.
kubectl rollout restart deployment/my-release-descheduler \
-n kube-system
# Watch the rollout complete:
kubectl rollout status deployment/my-release-descheduler \
-n kube-system
Watch the Descheduler Logs
The descheduler logs are your window into what's actually happening. You'll see node utilisation readings, how many pods match each plugin's criteria, and explicit lines for every pod that gets evicted.
kubectl -n kube-system logs deployment/my-release-descheduler -f
What to Look For in the Logs
# Descheduler prints node stats each cycle:
I Node "node-1" is over-utilized with usage: map[cpu:91 memory:88 pods:52]
I Node "node-2" is under-utilized with usage: map[cpu:42 memory:35 pods:18]
I Node "node-3" is appropriately utilized
# A pod getting evicted from the overloaded node:
I Evicted pod: "my-app-6d8f9b7c4-xk2pq" in namespace "production" from node "node-1"
# Summary at end of each cycle:
I Number of evicted pods: 3
# If no action needed:
I No pods to evict on node "node-1", skipping
kubectl top nodes and compare the output against your thresholds and targetThresholds values. The gap may be too wide.
ScheduleAnyway in your topology constraints while you tune.
Full Plugin Reference
| Plugin | Type | Default | What It Does |
|---|---|---|---|
| LowNodeUtilization | balance | ✓ ON | Evicts pods from over-used nodes to under-used ones based on CPU/memory/pod thresholds |
| RemoveDuplicates | balance | ✓ ON | Spreads pods from the same ReplicaSet/Deployment that landed on the same node |
| RemovePodsViolatingTopologySpreadConstraint | balance | ✓ ON | Enforces topologySpreadConstraints at runtime — evicts violating pods |
| RemovePodsViolatingInterPodAntiAffinity | deschedule | ✓ ON | Evicts pods co-located with pods they have anti-affinity rules against |
| RemovePodsViolatingNodeTaints | deschedule | ✓ ON | Evicts pods running on nodes they'd now be blocked from if scheduled fresh |
| RemovePodsViolatingNodeAffinity | deschedule | ✓ ON | Evicts pods that violate their required node affinity rules (post-schedule) |
| RemovePodsHavingTooManyRestarts | deschedule | ✓ ON | Evicts crashlooping pods (restart count > threshold) to let them reschedule on a better node |
| PodLifeTime | deschedule | OFF | Evicts pods older than a configured age — useful for refresh-based workloads |
Final Outcome
With the descheduler running alongside topologySpreadConstraints, your cluster now has both proactive and reactive pod placement — the strongest possible combination for node balance.
⚖️ What You've Built
🔌 Active Plugins
Key Takeaways
- Descheduler ≠ Scheduler — it evicts, the scheduler re-places. They work as a team.
- Always restart after ConfigMap edits — the pod reads config only at startup
- Thresholds are pod utilisation, not VM CPU/RAM — baseline with
kubectl top nodes - Start with
nodeFit: true— prevents eviction storms on small clusters - Pair with topologySpreadConstraints — they solve different halves of the same problem
- Watch logs after each config change — look for eviction count and node utilisation lines
PodDisruptionBudget for each Deployment. This tells the descheduler the minimum number of replicas that must stay up — so evictions are always graceful and safe.