☸️ Kubernetes Cluster Health Series

Kubernetes Descheduler
Rebalance Overloaded Nodes Automatically

When a node fills up fast, the scheduler alone isn't enough. The Descheduler continuously evicts and re-places pods to keep your cluster balanced — working hand-in-hand with topologySpreadConstraints.

Descheduler LowNodeUtilization Helm Install ConfigMap Pod Eviction kube-system Cluster Rebalancing

How the Descheduler Works

The default Kubernetes scheduler places pods once and never revisits that decision. Over time, nodes drift — some get overloaded while others sit idle. The Descheduler runs on a loop, detects these imbalances, and evicts pods so the scheduler can re-place them more optimally. Pair it with topologySpreadConstraints for complete placement control.

📦
Node Overloaded
Memory / CPU fills up over time
🔍
Descheduler
Detects imbalance, evicts pods
☸️
Scheduler
Re-places pods on better nodes
⚖️
Balanced
Even load across all nodes
🧠
SECTION 01

Why the Descheduler Exists

The Kubernetes scheduler makes placement decisions at pod creation time. Once placed, it never revisits. This causes a well-known cluster drift problem over time.

📈 Node Drift

Nodes added later to the cluster are underutilised — the scheduler filled older nodes first and new pods may still prefer them.

🔥 Memory Pressure

One node silently fills up with pods while others are idle. OOMKill events and throttling start appearing on the busy node.

💀 Constraint Violations

After node taints, affinity rules, or topology constraints are added, existing pods may violate them — but they're never moved.

🔁 Restart Loops

CrashLooping pods stay on a bad node forever. The descheduler can evict them so they land on a healthier node.

🔗
Use with topologySpreadConstraints The Descheduler is the runtime complement to topologySpreadConstraints. Spread constraints prevent bad placement at creation time. The Descheduler fixes bad placement that already happened. Together they give you complete, continuous cluster balance. See the topologySpreadConstraints guide for the other half of this setup.
💡
SECTION 02

Core Concepts

🔁 Eviction

The descheduler doesn't move pods directly. It evicts them (graceful delete), and the scheduler re-places them on a better node.

📋 Profiles

A named set of plugins that define what to check and evict. You can have multiple profiles with different policies.

🔌 Plugins

Individual rules — e.g. LowNodeUtilization, RemoveDuplicates. Each plugin checks one specific condition and evicts pods that violate it.

🛡️ DefaultEvictor

A mandatory plugin that guards which pods are evictable — it prevents evicting system-critical pods, DaemonSets, or pods with local storage.

Two Plugin Types

TypeWhat It DoesExample Plugins
balance Evicts to redistribute pods more evenly across nodes LowNodeUtilization, RemoveDuplicates
deschedule Evicts pods violating constraints or cluster rules RemovePodsViolatingNodeTaints, RemovePodsViolatingInterPodAntiAffinity
📋
SECTION 03

Prerequisites

  • Kubernetes 1.19+ — Descheduler v0.30+ requires at least Kubernetes 1.26 for full feature support
  • Helm 3 — used for installation in this guide; see note below for other methods
  • kubectl — configured with cluster admin access to edit ConfigMaps in kube-system
  • Multiple worker nodes — rebalancing has no effect on single-node clusters
  • PodDisruptionBudgets set — recommended for production so evictions are graceful
🔗
Other Installation Methods Helm is the simplest method and is used throughout this guide. If you prefer to install via raw manifests, Kustomize, or as a CronJob, the official docs cover all options at kubernetes-sigs.github.io/descheduler.

PHASE 01

Install via Helm

The simplest and most reliable install method is via the official Descheduler Helm chart. This also makes configuration upgrades straightforward later.

01

Add the Descheduler Helm Repository

Shell
helm repo add descheduler https://kubernetes-sigs.github.io/descheduler/
helm repo update
02

Install into kube-system Namespace

Installing in kube-system is conventional since the descheduler is a cluster-level component. You can install it in any namespace — just keep the -n flag consistent in all future commands.

Shell
helm install my-release \
  --namespace kube-system \
  descheduler/descheduler
💡
Namespace choice kube-system is used here because the descheduler needs cluster-wide visibility. You can use a dedicated namespace like descheduler if your org prefers separation — just ensure the ServiceAccount has the correct ClusterRole bindings.
🔍
PHASE 01 — Step 3

Verify the Installation

kubectl — Verify Pods & Resources
# Check descheduler pod is Running:
kubectl get pods -n kube-system | grep descheduler

# See all resources created by the Helm release:
kubectl get all -n kube-system -l app.kubernetes.io/name=descheduler

# Confirm ConfigMap was created:
kubectl get configmap -n kube-system | grep descheduler
Expected output You should see a pod named my-release-descheduler-XXXXX with status Running, and a ConfigMap named my-release-descheduler. The descheduler runs as a Deployment by default (polling mode — runs every 5 minutes).

📄
PHASE 02 — Step 1

View the Default ConfigMap

After installation, a ConfigMap is created with the default policy. This is what runs out of the box — before you make any changes.

Shell — Open Default Config
kubectl -n kube-system edit configmap my-release-descheduler
Default ConfigMap — Full Content
apiVersion: v1
data:
  policy.yaml: |
    apiVersion: "descheduler/v1alpha2"
    kind: "DeschedulerPolicy"
    profiles:
    - name: default
      pluginConfig:
      - args:
          podProtections:
            defaultDisabled:
            - PodsWithLocalStorage
            extraEnabled:
            - PodsWithPVC
        name: DefaultEvictor
      - name: RemoveDuplicates
      - args:
          includingInitContainers: true
          podRestartThreshold: 100
        name: RemovePodsHavingTooManyRestarts
      - args:
          nodeAffinityType:
          - requiredDuringSchedulingIgnoredDuringExecution
        name: RemovePodsViolatingNodeAffinity
      - name: RemovePodsViolatingNodeTaints
      - name: RemovePodsViolatingInterPodAntiAffinity
      - name: RemovePodsViolatingTopologySpreadConstraint
      - args:
          targetThresholds:
            cpu: 50
            memory: 50
            pods: 50
          thresholds:
            cpu: 20
            memory: 20
            pods: 20
        name: LowNodeUtilization
      plugins:
        balance:
          enabled:
          - RemoveDuplicates
          - RemovePodsViolatingTopologySpreadConstraint
          - LowNodeUtilization
        deschedule:
          enabled:
          - RemovePodsHavingTooManyRestarts
          - RemovePodsViolatingNodeTaints
          - RemovePodsViolatingNodeAffinity
          - RemovePodsViolatingInterPodAntiAffinity
kind: ConfigMap
metadata:
  name: my-release-descheduler
  namespace: kube-system
🔍
PHASE 02 — Step 2

What the Default Config Does

The default config is conservative by design. It uses very low thresholds (20%) to trigger eviction, meaning a node has to be nearly empty before pods are moved away from a busy one. For most real clusters with memory pressure, this is insufficient.

DefaultEvictor
Guard

Protects pods with local storage from eviction. Allows eviction of pods with PVCs. This is the safety gate for all other plugins.

RemoveDuplicates
Balance

Evicts duplicate pods (same owner) that land on the same node — leaves one, evicts the rest so they reschedule elsewhere.

RemovePodsHaving
TooManyRestarts
Deschedule

Evicts pods that have restarted more than 100 times. Helps crashlooping pods escape a bad node.

RemovePodsViolating
NodeAffinity
Deschedule

Evicts pods that no longer satisfy their requiredDuringScheduling node affinity rules (e.g. node label was removed).

RemovePodsViolating
NodeTaints
Deschedule

Evicts pods running on nodes they should no longer be on after new taints were added post-scheduling.

RemovePodsViolating
InterPodAntiAffinity
Deschedule

Evicts pods co-located with pods they have anti-affinity rules against — fixes violations that arose after pod placement.

RemovePodsViolating
TopologySpreadConstraint
Balance

Evicts pods that violate existing topologySpreadConstraints — the runtime enforcement partner to your spread config.

LowNodeUtilization
Balance

Evicts pods from overutilised nodes (above targetThresholds) so they land on underutilised ones (below thresholds). The key plugin for memory pressure.

⚠️
Default thresholds are too conservative for most clusters With thresholds: cpu/memory/pods: 20 and targetThresholds: 50, eviction only triggers if a node is under 20% utilisation AND another is over 50%. In a busy cluster where most nodes run at 60–90%, this almost never fires. You need to raise the thresholds to match your actual usage.
✏️
PHASE 02 — Step 3

Apply the Custom Configuration

Edit the ConfigMap directly with kubectl. The key change is raising the thresholds to match real-world node utilisation — these values represent pod utilisation percentages on the node, not VM-level CPU/RAM percentages.

💡
Important: These are pod utilisation metrics, not VM metrics thresholds.cpu: 90 means "a node is considered underutilised if its pods collectively use less than 90% of allocatable CPU." This is about what pods request and use, not the raw VM CPU reading. Tune these after observing your actual cluster's average utilisation with kubectl top nodes.
Shell — Open ConfigMap for Editing
kubectl -n kube-system edit configmap my-release-descheduler

Replace the entire policy.yaml value with the following custom configuration:

Custom ConfigMap — Production-Tuned Config
apiVersion: v1
data:
  policy.yaml: |
    apiVersion: "descheduler/v1alpha2"
    kind: "DeschedulerPolicy"

    # Exclude control-plane / master nodes from descheduling
    nodeSelector: "node-role.kubernetes.io/master!=true,node-role.kubernetes.io/control-plane!=true"

    profiles:
      - name: default
        pluginConfig:
        - name: "DefaultEvictor"
          args:
            nodeFit: true    # only evict if a better node actually exists
        - name: "RemovePodsViolatingInterPodAntiAffinity"
        - name: "RemoveDuplicates"
        - name: "LowNodeUtilization"
          args:
            thresholds:
              "cpu"    : 90   # node < 90% CPU usage → underutilised (evict TO here)
              "memory" : 80   # node < 80% memory  → underutilised
              "pods"   : 40   # node < 40% pod count → underutilised
            targetThresholds:
              "cpu"    : 90   # node > 90% CPU usage → overutilised (evict FROM here)
              "memory" : 85   # node > 85% memory  → overutilised
              "pods"   : 50   # node > 50% pod count → overutilised

        plugins:
          balance:
            enabled:
              - "LowNodeUtilization"
              - "RemoveDuplicates"
          deschedule:
            enabled:
              - "RemovePodsViolatingInterPodAntiAffinity"
kind: ConfigMap
metadata:
  name: my-release-descheduler
  namespace: kube-system
⚖️
PHASE 02 — Step 4

Default vs Custom — Side-by-Side

Here's exactly what changed between the two configs and why each change matters.

⛔ Default Config
nodeSelector not set — control-plane nodes included
nodeFit not set — may evict with nowhere to go
thresholds cpu/mem/pods → 20 / 20 / 20 (very conservative)
targetThresholds → 50 / 50 / 50
Many plugins enabled — TooManyRestarts, NodeAffinity, NodeTaints, TopologySpread, plus balance plugins
PodsWithLocalStorage disabled (protected)
✅ Custom Config
nodeSelector excludes master + control-plane nodes explicitly
nodeFit: true — only evicts if a suitable destination node exists
thresholds cpu/mem/pods → 90 / 80 / 40 (matches real-world usage)
targetThresholds → 90 / 85 / 50
Focused plugin set — only LowNodeUtilization, RemoveDuplicates, InterPodAntiAffinity
Simpler, lower risk — fewer surprise evictions in production

Why These Changes Improve Production Stability

🎯 Realistic Thresholds

Default 20/50 barely ever fires on busy clusters. 90/85 matches actual node usage — eviction triggers when nodes are genuinely overloaded.

🛡️ nodeFit: true

Without this, a pod can be evicted and then fail to reschedule (Pending). nodeFit: true ensures a valid destination exists before eviction happens.

🚫 Control-Plane Excluded

The nodeSelector prevents the descheduler from touching control-plane nodes — critical in clusters where masters also run workloads.

🔌 Focused Plugin Set

Fewer plugins means fewer unexpected evictions. Start minimal and add plugins back one by one once you understand the eviction behaviour on your cluster.

📊
PHASE 02 — Step 5

Understanding LowNodeUtilization Thresholds

LowNodeUtilization uses two threshold bands. A node is underutilised if it is below all thresholds values. A node is overutilised if it exceeds any targetThresholds value. Pods are evicted from overutilised nodes and land on underutilised ones.

⚠️
These are NOT VM resource percentages cpu: 90 does not mean 90% of the host machine's CPU. It means 90% of the node's allocatable CPU as seen by Kubernetes — based on pod requests and actual usage. Always baseline with kubectl top nodes before setting these values.

Custom Threshold Visualisation

CPU
threshold (underutilised below)
targetThreshold (overutilised above)
90%threshold90%
Memory
threshold (underutilised below)
targetThreshold (overutilised above)
80%threshold85%
Pods
threshold (underutilised below)
targetThreshold (overutilised above)
40%threshold50%
ResourceThresholdTargetThresholdMeaning
CPU90%90% Evict pods from nodes using >90% CPU to nodes using <90%
Memory80%85% Evict pods from nodes using >85% memory to nodes using <80%
Pods40%50% Evict pods from nodes above 50% pod capacity to nodes below 40%
💡
How to baseline your thresholds Run kubectl top nodes and note the average CPU% and Memory% across your worker nodes. Set targetThresholds about 5–10% above your average. Set thresholds about 10–15% below your average. This creates a healthy eviction band without thrashing.

🔄
PHASE 03 — Step 1

Rollout Restart to Apply Config Changes

After editing the ConfigMap, the running descheduler pod does not automatically pick up the new config. You must restart the Deployment to reload the ConfigMap.

kubectl — Restart the Descheduler
kubectl rollout restart deployment/my-release-descheduler \
  -n kube-system

# Watch the rollout complete:
kubectl rollout status deployment/my-release-descheduler \
  -n kube-system
⚠️
Always restart after ConfigMap changes Editing the ConfigMap alone has no effect. The descheduler reads the policy only at startup. A rollout restart creates a new pod that loads the updated config from scratch.
📋
PHASE 03 — Step 2

Watch the Descheduler Logs

The descheduler logs are your window into what's actually happening. You'll see node utilisation readings, how many pods match each plugin's criteria, and explicit lines for every pod that gets evicted.

kubectl — Tail Descheduler Logs
kubectl -n kube-system logs deployment/my-release-descheduler -f

What to Look For in the Logs

Example Log Output — Annotated
# Descheduler prints node stats each cycle:
I Node "node-1" is over-utilized with usage: map[cpu:91 memory:88 pods:52]
I Node "node-2" is under-utilized with usage: map[cpu:42 memory:35 pods:18]
I Node "node-3" is appropriately utilized

# A pod getting evicted from the overloaded node:
I Evicted pod: "my-app-6d8f9b7c4-xk2pq" in namespace "production" from node "node-1"

# Summary at end of each cycle:
I Number of evicted pods: 3

# If no action needed:
I No pods to evict on node "node-1", skipping
💡
Seeing 0 evictions? This is often fine — it means your thresholds are not being breached. If you expect evictions but see none, run kubectl top nodes and compare the output against your thresholds and targetThresholds values. The gap may be too wide.
⚠️
Seeing too many evictions? If pods are being evicted and rescheduling repeatedly (thrashing), your threshold and targetThreshold bands are too close together. Increase the gap between them — or temporarily switch to ScheduleAnyway in your topology constraints while you tune.

📚
REFERENCE

Full Plugin Reference

PluginTypeDefaultWhat It Does
LowNodeUtilization balance ✓ ON Evicts pods from over-used nodes to under-used ones based on CPU/memory/pod thresholds
RemoveDuplicates balance ✓ ON Spreads pods from the same ReplicaSet/Deployment that landed on the same node
RemovePodsViolatingTopologySpreadConstraint balance ✓ ON Enforces topologySpreadConstraints at runtime — evicts violating pods
RemovePodsViolatingInterPodAntiAffinity deschedule ✓ ON Evicts pods co-located with pods they have anti-affinity rules against
RemovePodsViolatingNodeTaints deschedule ✓ ON Evicts pods running on nodes they'd now be blocked from if scheduled fresh
RemovePodsViolatingNodeAffinity deschedule ✓ ON Evicts pods that violate their required node affinity rules (post-schedule)
RemovePodsHavingTooManyRestarts deschedule ✓ ON Evicts crashlooping pods (restart count > threshold) to let them reschedule on a better node
PodLifeTime deschedule OFF Evicts pods older than a configured age — useful for refresh-based workloads
SECTION 08

Final Outcome

With the descheduler running alongside topologySpreadConstraints, your cluster now has both proactive and reactive pod placement — the strongest possible combination for node balance.

⚖️ What You've Built

🔄 Continuous automatic rebalancing
🔥 Memory pressure handled before OOMKill
🚫 Control-plane nodes excluded from eviction
🛡️ nodeFit prevents orphaned Pending pods
📋 Configmap-driven, no redeploy needed for tuning

🔌 Active Plugins

LowNodeUtilization
RemoveDuplicates
RemovePodsViolatingInterPodAntiAffinity
DefaultEvictor (nodeFit: true)
🔲 Others available when needed

Key Takeaways

  • Descheduler ≠ Scheduler — it evicts, the scheduler re-places. They work as a team.
  • Always restart after ConfigMap edits — the pod reads config only at startup
  • Thresholds are pod utilisation, not VM CPU/RAM — baseline with kubectl top nodes
  • Start with nodeFit: true — prevents eviction storms on small clusters
  • Pair with topologySpreadConstraints — they solve different halves of the same problem
  • Watch logs after each config change — look for eviction count and node utilisation lines
🔜
Next Step: Add PodDisruptionBudgets To ensure evictions don't take down your service, define a PodDisruptionBudget for each Deployment. This tells the descheduler the minimum number of replicas that must stay up — so evictions are always graceful and safe.