Kubernetes Descheduler Setup | Darin DevOps Blogs

🧠

SECTION 01

Why the Descheduler Exists

The Kubernetes scheduler makes placement decisions at pod creation time. Once placed, it never revisits. This causes a well-known cluster drift problem over time.

📈 Node Drift

Nodes added later to the cluster are underutilised — the scheduler filled older nodes first and new pods may still prefer them.

🔥 Memory Pressure

One node silently fills up with pods while others are idle. OOMKill events and throttling start appearing on the busy node.

💀 Constraint Violations

After node taints, affinity rules, or topology constraints are added, existing pods may violate them — but they're never moved.

🔁 Restart Loops

CrashLooping pods stay on a bad node forever. The descheduler can evict them so they land on a healthier node.

🔗

Use with topologySpreadConstraints The Descheduler is the runtime complement to topologySpreadConstraints. Spread constraints prevent bad placement at creation time. The Descheduler fixes bad placement that already happened. Together they give you complete, continuous cluster balance. See the topologySpreadConstraints guide for the other half of this setup.

💡

SECTION 02

Core Concepts

🔁 Eviction

The descheduler doesn't move pods directly. It evicts them (graceful delete), and the scheduler re-places them on a better node.

📋 Profiles

A named set of plugins that define what to check and evict. You can have multiple profiles with different policies.

🔌 Plugins

Individual rules — e.g. LowNodeUtilization, RemoveDuplicates. Each plugin checks one specific condition and evicts pods that violate it.

🛡️ DefaultEvictor

A mandatory plugin that guards which pods are evictable — it prevents evicting system-critical pods, DaemonSets, or pods with local storage.

Two Plugin Types

Type	What It Does	Example Plugins
balance	Evicts to redistribute pods more evenly across nodes	LowNodeUtilization, RemoveDuplicates
deschedule	Evicts pods violating constraints or cluster rules	RemovePodsViolatingNodeTaints, RemovePodsViolatingInterPodAntiAffinity

📋

SECTION 03

Prerequisites

Kubernetes 1.19+ — Descheduler v0.30+ requires at least Kubernetes 1.26 for full feature support
Helm 3 — used for installation in this guide; see note below for other methods
kubectl — configured with cluster admin access to edit ConfigMaps in kube-system
Multiple worker nodes — rebalancing has no effect on single-node clusters
PodDisruptionBudgets set — recommended for production so evictions are graceful

🔗

Other Installation Methods Helm is the simplest method and is used throughout this guide. If you prefer to install via raw manifests, Kustomize, or as a CronJob, the official docs cover all options at kubernetes-sigs.github.io/descheduler.

⛵

PHASE 01

Install via Helm

The simplest and most reliable install method is via the official Descheduler Helm chart. This also makes configuration upgrades straightforward later.

01

Add the Descheduler Helm Repository

Shell

helm repo add descheduler https://kubernetes-sigs.github.io/descheduler/
helm repo update

02

Install into kube-system Namespace

Installing in kube-system is conventional since the descheduler is a cluster-level component. You can install it in any namespace — just keep the -n flag consistent in all future commands.

Shell

helm install my-release \
  --namespace kube-system \
  descheduler/descheduler

💡

Namespace choice kube-system is used here because the descheduler needs cluster-wide visibility. You can use a dedicated namespace like descheduler if your org prefers separation — just ensure the ServiceAccount has the correct ClusterRole bindings.

🔍

PHASE 01 — Step 3

Verify the Installation

kubectl — Verify Pods & Resources

# Check descheduler pod is Running:
kubectl get pods -n kube-system | grep descheduler

# See all resources created by the Helm release:
kubectl get all -n kube-system -l app.kubernetes.io/name=descheduler

# Confirm ConfigMap was created:
kubectl get configmap -n kube-system | grep descheduler

✅

Expected output You should see a pod named my-release-descheduler-XXXXX with status Running, and a ConfigMap named my-release-descheduler. The descheduler runs as a Deployment by default (polling mode — runs every 5 minutes).

📄

PHASE 02 — Step 1

View the Default ConfigMap

After installation, a ConfigMap is created with the default policy. This is what runs out of the box — before you make any changes.

Shell — Open Default Config

kubectl -n kube-system edit configmap my-release-descheduler

Default ConfigMap — Full Content

apiVersion: v1
data:
  policy.yaml: |
    apiVersion: "descheduler/v1alpha2"
    kind: "DeschedulerPolicy"
    profiles:
    - name: default
      pluginConfig:
      - args:
          podProtections:
            defaultDisabled:
            - PodsWithLocalStorage
            extraEnabled:
            - PodsWithPVC
        name: DefaultEvictor
      - name: RemoveDuplicates
      - args:
          includingInitContainers: true
          podRestartThreshold: 100
        name: RemovePodsHavingTooManyRestarts
      - args:
          nodeAffinityType:
          - requiredDuringSchedulingIgnoredDuringExecution
        name: RemovePodsViolatingNodeAffinity
      - name: RemovePodsViolatingNodeTaints
      - name: RemovePodsViolatingInterPodAntiAffinity
      - name: RemovePodsViolatingTopologySpreadConstraint
      - args:
          targetThresholds:
            cpu: 50
            memory: 50
            pods: 50
          thresholds:
            cpu: 20
            memory: 20
            pods: 20
        name: LowNodeUtilization
      plugins:
        balance:
          enabled:
          - RemoveDuplicates
          - RemovePodsViolatingTopologySpreadConstraint
          - LowNodeUtilization
        deschedule:
          enabled:
          - RemovePodsHavingTooManyRestarts
          - RemovePodsViolatingNodeTaints
          - RemovePodsViolatingNodeAffinity
          - RemovePodsViolatingInterPodAntiAffinity
kind: ConfigMap
metadata:
  name: my-release-descheduler
  namespace: kube-system

🔍

PHASE 02 — Step 2

What the Default Config Does

The default config is conservative by design. It uses very low thresholds (20%) to trigger eviction, meaning a node has to be nearly empty before pods are moved away from a busy one. For most real clusters with memory pressure, this is insufficient.

DefaultEvictor

Guard

Protects pods with local storage from eviction. Allows eviction of pods with PVCs. This is the safety gate for all other plugins.

RemoveDuplicates

Balance

Evicts duplicate pods (same owner) that land on the same node — leaves one, evicts the rest so they reschedule elsewhere.

RemovePodsHaving
TooManyRestarts

Deschedule

Evicts pods that have restarted more than 100 times. Helps crashlooping pods escape a bad node.

RemovePodsViolating
NodeAffinity

Deschedule

Evicts pods that no longer satisfy their requiredDuringScheduling node affinity rules (e.g. node label was removed).

RemovePodsViolating
NodeTaints

Deschedule

Evicts pods running on nodes they should no longer be on after new taints were added post-scheduling.

RemovePodsViolating
InterPodAntiAffinity

Deschedule

Evicts pods co-located with pods they have anti-affinity rules against — fixes violations that arose after pod placement.

RemovePodsViolating
TopologySpreadConstraint

Balance

Evicts pods that violate existing topologySpreadConstraints — the runtime enforcement partner to your spread config.

LowNodeUtilization

Balance

Evicts pods from overutilised nodes (above targetThresholds) so they land on underutilised ones (below thresholds). The key plugin for memory pressure.

⚠️

Default thresholds are too conservative for most clusters With thresholds: cpu/memory/pods: 20 and targetThresholds: 50, eviction only triggers if a node is under 20% utilisation AND another is over 50%. In a busy cluster where most nodes run at 60–90%, this almost never fires. You need to raise the thresholds to match your actual usage.

✏️

PHASE 02 — Step 3

Apply the Custom Configuration

Edit the ConfigMap directly with kubectl. The key change is raising the thresholds to match real-world node utilisation — these values represent pod utilisation percentages on the node, not VM-level CPU/RAM percentages.

💡

Important: These are pod utilisation metrics, not VM metrics thresholds.cpu: 90 means "a node is considered underutilised if its pods collectively use less than 90% of allocatable CPU." This is about what pods request and use, not the raw VM CPU reading. Tune these after observing your actual cluster's average utilisation with kubectl top nodes.

Shell — Open ConfigMap for Editing

kubectl -n kube-system edit configmap my-release-descheduler

Replace the entire policy.yaml value with the following custom configuration:

Custom ConfigMap — Production-Tuned Config

apiVersion: v1
data:
  policy.yaml: |
    apiVersion: "descheduler/v1alpha2"
    kind: "DeschedulerPolicy"

    # Exclude control-plane / master nodes from descheduling
    nodeSelector: "node-role.kubernetes.io/master!=true,node-role.kubernetes.io/control-plane!=true"

    profiles:
      - name: default
        pluginConfig:
        - name: "DefaultEvictor"
          args:
            nodeFit: true    # only evict if a better node actually exists
        - name: "RemovePodsViolatingInterPodAntiAffinity"
        - name: "RemoveDuplicates"
        - name: "LowNodeUtilization"
          args:
            thresholds:
              "cpu"    : 90   # node < 90% CPU usage → underutilised (evict TO here)
              "memory" : 80   # node < 80% memory  → underutilised
              "pods"   : 40   # node < 40% pod count → underutilised
            targetThresholds:
              "cpu"    : 90   # node > 90% CPU usage → overutilised (evict FROM here)
              "memory" : 85   # node > 85% memory  → overutilised
              "pods"   : 50   # node > 50% pod count → overutilised

        plugins:
          balance:
            enabled:
              - "LowNodeUtilization"
              - "RemoveDuplicates"
          deschedule:
            enabled:
              - "RemovePodsViolatingInterPodAntiAffinity"
kind: ConfigMap
metadata:
  name: my-release-descheduler
  namespace: kube-system

⚖️

PHASE 02 — Step 4

Default vs Custom — Side-by-Side

Here's exactly what changed between the two configs and why each change matters.

⛔ Default Config

nodeSelector not set — control-plane nodes included

nodeFit not set — may evict with nowhere to go

thresholds cpu/mem/pods → 20 / 20 / 20 (very conservative)

targetThresholds → 50 / 50 / 50

Many plugins enabled — TooManyRestarts, NodeAffinity, NodeTaints, TopologySpread, plus balance plugins

PodsWithLocalStorage disabled (protected)

✅ Custom Config

nodeSelector excludes master + control-plane nodes explicitly

nodeFit: true — only evicts if a suitable destination node exists

thresholds cpu/mem/pods → 90 / 80 / 40 (matches real-world usage)

targetThresholds → 90 / 85 / 50

Focused plugin set — only LowNodeUtilization, RemoveDuplicates, InterPodAntiAffinity

Simpler, lower risk — fewer surprise evictions in production

Why These Changes Improve Production Stability

🎯 Realistic Thresholds

Default 20/50 barely ever fires on busy clusters. 90/85 matches actual node usage — eviction triggers when nodes are genuinely overloaded.

🛡️ nodeFit: true

Without this, a pod can be evicted and then fail to reschedule (Pending). nodeFit: true ensures a valid destination exists before eviction happens.

🚫 Control-Plane Excluded

The nodeSelector prevents the descheduler from touching control-plane nodes — critical in clusters where masters also run workloads.

🔌 Focused Plugin Set

Fewer plugins means fewer unexpected evictions. Start minimal and add plugins back one by one once you understand the eviction behaviour on your cluster.

📊

PHASE 02 — Step 5

Understanding LowNodeUtilization Thresholds

LowNodeUtilization uses two threshold bands. A node is underutilised if it is below all thresholds values. A node is overutilised if it exceeds any targetThresholds value. Pods are evicted from overutilised nodes and land on underutilised ones.

⚠️

These are NOT VM resource percentages cpu: 90 does not mean 90% of the host machine's CPU. It means 90% of the node's allocatable CPU as seen by Kubernetes — based on pod requests and actual usage. Always baseline with kubectl top nodes before setting these values.

Custom Threshold Visualisation

CPU

threshold (underutilised below)

targetThreshold (overutilised above)

90%threshold90%

Memory

threshold (underutilised below)

targetThreshold (overutilised above)

80%threshold85%

Pods

threshold (underutilised below)

targetThreshold (overutilised above)

40%threshold50%

Resource	Threshold	TargetThreshold	Meaning
CPU	90%	90%	Evict pods from nodes using >90% CPU to nodes using <90%
Memory	80%	85%	Evict pods from nodes using >85% memory to nodes using <80%
Pods	40%	50%	Evict pods from nodes above 50% pod capacity to nodes below 40%

💡

How to baseline your thresholds Run kubectl top nodes and note the average CPU% and Memory% across your worker nodes. Set targetThresholds about 5–10% above your average. Set thresholds about 10–15% below your average. This creates a healthy eviction band without thrashing.

🔄

PHASE 03 — Step 1

Rollout Restart to Apply Config Changes

After editing the ConfigMap, the running descheduler pod does not automatically pick up the new config. You must restart the Deployment to reload the ConfigMap.

kubectl — Restart the Descheduler

kubectl rollout restart deployment/my-release-descheduler \
  -n kube-system

# Watch the rollout complete:
kubectl rollout status deployment/my-release-descheduler \
  -n kube-system

⚠️

Always restart after ConfigMap changes Editing the ConfigMap alone has no effect. The descheduler reads the policy only at startup. A rollout restart creates a new pod that loads the updated config from scratch.

📋

PHASE 03 — Step 2

Watch the Descheduler Logs

The descheduler logs are your window into what's actually happening. You'll see node utilisation readings, how many pods match each plugin's criteria, and explicit lines for every pod that gets evicted.

kubectl — Tail Descheduler Logs

kubectl -n kube-system logs deployment/my-release-descheduler -f

What to Look For in the Logs

Example Log Output — Annotated

# Descheduler prints node stats each cycle:
I Node "node-1" is over-utilized with usage: map[cpu:91 memory:88 pods:52]
I Node "node-2" is under-utilized with usage: map[cpu:42 memory:35 pods:18]
I Node "node-3" is appropriately utilized

# A pod getting evicted from the overloaded node:
I Evicted pod: "my-app-6d8f9b7c4-xk2pq" in namespace "production" from node "node-1"

# Summary at end of each cycle:
I Number of evicted pods: 3

# If no action needed:
I No pods to evict on node "node-1", skipping

💡

Seeing 0 evictions? This is often fine — it means your thresholds are not being breached. If you expect evictions but see none, run kubectl top nodes and compare the output against your thresholds and targetThresholds values. The gap may be too wide.

⚠️

Seeing too many evictions? If pods are being evicted and rescheduling repeatedly (thrashing), your threshold and targetThreshold bands are too close together. Increase the gap between them — or temporarily switch to ScheduleAnyway in your topology constraints while you tune.

📚

REFERENCE

Full Plugin Reference

Plugin	Type	Default	What It Does
LowNodeUtilization	balance	✓ ON	Evicts pods from over-used nodes to under-used ones based on CPU/memory/pod thresholds
RemoveDuplicates	balance	✓ ON	Spreads pods from the same ReplicaSet/Deployment that landed on the same node
RemovePodsViolatingTopologySpreadConstraint	balance	✓ ON	Enforces topologySpreadConstraints at runtime — evicts violating pods
RemovePodsViolatingInterPodAntiAffinity	deschedule	✓ ON	Evicts pods co-located with pods they have anti-affinity rules against
RemovePodsViolatingNodeTaints	deschedule	✓ ON	Evicts pods running on nodes they'd now be blocked from if scheduled fresh
RemovePodsViolatingNodeAffinity	deschedule	✓ ON	Evicts pods that violate their required node affinity rules (post-schedule)
RemovePodsHavingTooManyRestarts	deschedule	✓ ON	Evicts crashlooping pods (restart count > threshold) to let them reschedule on a better node
PodLifeTime	deschedule	OFF	Evicts pods older than a configured age — useful for refresh-based workloads

✅

SECTION 08

Final Outcome

With the descheduler running alongside topologySpreadConstraints, your cluster now has both proactive and reactive pod placement — the strongest possible combination for node balance.

⚖️ What You've Built

🔄 Continuous automatic rebalancing

🔥 Memory pressure handled before OOMKill

🚫 Control-plane nodes excluded from eviction

🛡️ nodeFit prevents orphaned Pending pods

📋 Configmap-driven, no redeploy needed for tuning

🔌 Active Plugins

✅ LowNodeUtilization

✅ RemoveDuplicates

✅ RemovePodsViolatingInterPodAntiAffinity

✅ DefaultEvictor (nodeFit: true)

🔲 Others available when needed

Key Takeaways

Descheduler ≠ Scheduler — it evicts, the scheduler re-places. They work as a team.
Always restart after ConfigMap edits — the pod reads config only at startup
Thresholds are pod utilisation, not VM CPU/RAM — baseline with kubectl top nodes
Start with nodeFit: true — prevents eviction storms on small clusters
Pair with topologySpreadConstraints — they solve different halves of the same problem
Watch logs after each config change — look for eviction count and node utilisation lines

🔜

Next Step: Add PodDisruptionBudgets To ensure evictions don't take down your service, define a PodDisruptionBudget for each Deployment. This tells the descheduler the minimum number of replicas that must stay up — so evictions are always graceful and safe.

Kubernetes Descheduler —
Rebalance Overloaded Nodes Automatically

How the Descheduler Works

Why the Descheduler Exists

📈 Node Drift

🔥 Memory Pressure

💀 Constraint Violations

🔁 Restart Loops

Core Concepts

🔁 Eviction

📋 Profiles

🔌 Plugins

🛡️ DefaultEvictor

Two Plugin Types

Prerequisites

Install via Helm

Add the Descheduler Helm Repository

Install into kube-system Namespace

Verify the Installation

View the Default ConfigMap

What the Default Config Does

Apply the Custom Configuration

Default vs Custom — Side-by-Side

Why These Changes Improve Production Stability

🎯 Realistic Thresholds

🛡️ nodeFit: true

🚫 Control-Plane Excluded

🔌 Focused Plugin Set

Understanding LowNodeUtilization Thresholds

Custom Threshold Visualisation

Rollout Restart to Apply Config Changes

Watch the Descheduler Logs

What to Look For in the Logs

Full Plugin Reference

Final Outcome

⚖️ What You've Built

🔌 Active Plugins

Key Takeaways

Kubernetes Descheduler —Rebalance Overloaded Nodes Automatically

How the Descheduler Works

Why the Descheduler Exists

📈 Node Drift

🔥 Memory Pressure

💀 Constraint Violations

🔁 Restart Loops

Core Concepts

🔁 Eviction

📋 Profiles

🔌 Plugins

🛡️ DefaultEvictor

Two Plugin Types

Prerequisites

Install via Helm

Add the Descheduler Helm Repository

Install into kube-system Namespace

Verify the Installation

View the Default ConfigMap

What the Default Config Does

Apply the Custom Configuration

Default vs Custom — Side-by-Side

Why These Changes Improve Production Stability

🎯 Realistic Thresholds

🛡️ nodeFit: true

🚫 Control-Plane Excluded

🔌 Focused Plugin Set

Understanding LowNodeUtilization Thresholds

Custom Threshold Visualisation

Rollout Restart to Apply Config Changes

Watch the Descheduler Logs

What to Look For in the Logs

Full Plugin Reference

Final Outcome

⚖️ What You've Built

🔌 Active Plugins

Key Takeaways

Kubernetes Descheduler —
Rebalance Overloaded Nodes Automatically