This content originally appeared on DEV Community and was authored by cloud-sky-ops
Now you’ve laid a baseline of namespace isolation, quotas, network policy, and PSA (if not, see Post 1 of the series), the next layer is reliability. In this post I walk you through the key Kubernetes primitives that help your workloads survive disruptions and evolve safely: probes, PDBs, topology constraints, and rollout strategies. This blog dives deeper into these nuanced offerings by Kubernetes, buckle up for some fun, hope you enjoy the ride.
Executive Summary
- Use liveness, readiness, and startup probes to let Kubernetes detect and recover from unhealthy application states.
- A PodDisruptionBudget (PDB) ensures voluntary disruptions (e.g. node drain, rolling upgrades) don’t violate your availability SLO.
- TopologySpreadConstraints force pods to be balanced across failure domains (zones, nodes) to reduce blast radius.
- Carefully configure rollout strategies (surge, maxUnavailable) in your Deployment to control downtime vs speed.
- Together, these tools let you design reliability from the start—preventing cascading failures rather than firefighting.
Prereqs
- You already have a Kubernetes cluster with
kubectl
access (as assumed in Post 1). - You have an existing Deployment (or create one) that you can modify.
- You have at least two nodes (ideally in different zones or failure domains) to test spread constraints.
- You can cordon/drain nodes (
kubectl drain
) to simulate disruption.
Concepts
A. Probes: Liveness / Readiness / Startup
- Definition: Probes are periodic checks (HTTP, TCP, exec) that Kubernetes makes into containers to detect their health or readiness. Without them, a stuck process can stay “Running” indefinitely, and traffic may go to unhealthy pods.
-
Best practices:
- Always include readiness in your services so endpoints only include truly ready pods.
- Use startup probe for apps with long initialization (so liveness doesn’t kill them prematurely).
- Be conservative: gentle probe intervals and timeouts to avoid false negatives under GC / background load.
- Test locally to find thresholds under load.
Commands / YAML snippets:
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 2
livenessProbe:
httpGet:
path: /live
port: 8080
initialDelaySeconds: 15
periodSeconds: 10
timeoutSeconds: 2
startupProbe:
httpGet:
path: /ready
port: 8080
failureThreshold: 30
periodSeconds: 10
Before → After:
- Before: Pod is “Running” perpetually, even if app crashes internally; Service routes traffic to a dead process.
- After: Kubernetes restarts the pod automatically (liveness), and the pod is only added to load via readiness when healthy.
When to use: Always for production; startup probes if your app has long boot phases.
B. PodDisruptionBudget (PDB)
- Definition: A PDB is a policy that defines the minimum number (or fraction) of pods that must remain available during voluntary disruptions. Ensures your system doesn’t accidentally violate availability during upgrades, node drains, or autoscaling events.
-
Best practices:
- Use
minAvailable
when you want a floor on availability, ormaxUnavailable
for a cap on disruption. - Don’t set them too tight (you might block your own updates). Leave wiggle room.
- Align PDBs with your rolling update strategy (surge / unavailable) to avoid deadlocks.
- Monitor PDB status (
kubectl get pdb
) to detect stuck updates.
- Use
Commands / YAML snippet:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: my-app-pdb
spec:
minAvailable: 2
selector:
matchLabels:
app: my-app
Before → After:
- Before: During a node drain, all pods could be disrupted and cause downtime.
- After: Only one pod can be evicted at a time, preserving minimal availability.
When to use: For any service with more than one replica; optional for batch jobs but still beneficial.
C. TopologySpreadConstraints
- Definition: A declaration in pod spec that controls how Kubernetes spreads pods across failure domains (nodes, zones) to enforce balance. Avoid overconcentration: if one zone or node goes down, you don’t lose your entire workload.
-
Best practices:
- Use well-known node labels: e.g.
topology.kubernetes.io/zone
orkubernetes.io/hostname
. -
maxSkew = 1
is a typical starting point (difference <=1 pod across domains). -
whenUnsatisfiable
: useDoNotSchedule
for strict spreading orScheduleAnyway
for softer enforcement. - Use the same spread constraints on all revisions of your Deployment to maintain consistency.
- Use well-known node labels: e.g.
Commands / YAML snippet:
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: my-app
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: my-app
Before → After:
- Before: All pods end up in one zone or node (fastest node, better resources).
- After: Pods evenly distributed across zones/nodes, reducing risk if one fails.
When to use: For replicated workloads in HA setups, especially multi-zone or multi-node clusters.
D. Rollout Strategies (Surge / maxUnavailable)
-
Definition: In a Deployment’s
strategy.rollingUpdate
,maxSurge
is how many extra pods can be created during upgrade;maxUnavailable
is how many pods are allowed to drop at once. They control the trade-off between speed and availability during upgrades. -
Best practices:
- Use
maxUnavailable: 0
andmaxSurge: 1
(or more) for zero-downtime (if resources allow). - For batch or low-priority workloads, allow some unavailability for faster rollout (e.g. 20–30%).
- Always test with your PDB + spread settings to ensure upgrade doesn’t stall.
- Use
Commands / YAML snippet:
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
Before → After:
- Before: Default 25% surge/unavailable may drop too many pods at once, violating SLO during high load.
- After: You control how many can be updated and how many remain live.
When to use: Always set explicitly rather than relying on defaults; vary per workload type.
Mini-Lab
Let’s walk through building an app that has probes, PDB, and topology spread constraints. Then simulate node disruption and see your reliability in action.
Step 1: Namespace + sample deployment
kubectl create ns reliability-demo
kubectl config set-context --current --namespace=reliability-demo
Deploy a simple app (e.g. HTTP echo) with 3 replicas:
apiVersion: apps/v1
kind: Deployment
metadata:
name: echo
labels: { app: echo }
spec:
replicas: 3
selector:
matchLabels: { app: echo }
template:
metadata:
labels: { app: echo }
spec:
containers:
- name: app
image: hashicorp/http-echo:0.2.3
args:
- "-text=hello"
ports:
- containerPort: 5678
kubectl apply -f echo.yaml
kubectl get pods -o wide
Step 2: Add probes, PDB, and topology constraints
Edit the above spec:
# inside spec.template.spec.containers[0]:
readinessProbe:
httpGet:
path: /health
port: 5678
initialDelaySeconds: 3
periodSeconds: 5
livenessProbe:
httpGet:
path: /live
port: 5678
initialDelaySeconds: 10
periodSeconds: 10
# Also add at spec
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels: { app: echo }
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels: { app: echo }
Create a PDB:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: echo-pdb
spec:
minAvailable: 2
selector:
matchLabels:
app: echo
Apply both:
kubectl apply -f modified-echo.yaml
kubectl apply -f pdb-echo.yaml
Check status:
kubectl get pods -o wide
kubectl get pdb
kubectl describe pdb echo-pdb
Step 3: Simulate node drain and verify behavior
Pick one node:
NODE=$(kubectl get nodes -o jsonpath='{.items[0].metadata.name}')
kubectl cordon $NODE
kubectl drain $NODE --ignore-daemonsets --delete-emptydir-data --disable-eviction
Watch pods:
kubectl get pods -o wide -w
Expectations:
- One pod evicted (PDB ensures minAvailable 2 remain).
- New pods rescheduled onto other nodes (respecting topology constraints).
- No downtime for readiness endpoints if requests route to healthy pods.
Restore node:
kubectl uncordon $NODE
Bonus: Trigger a rollout:
kubectl set image deployment/echo app=hashicorp/http-echo:0.3.0
kubectl rollout status deployment/echo
Observe that upgrades are safe, respecting availability and spread.
Cheat Sheet Table
Action | Command / YAML | Purpose / Note |
---|---|---|
Add readiness probe | see snippet above | Ensures pod is only traffic-ready when healthy |
Add liveness probe | see snippet above | Restarts stuck pods |
Add startup probe | similar pattern as above | Prevents liveness kill during slow init |
Create PDB | kubectl apply -f pdb.yaml |
Enforce minimal availability |
Check PDB |
kubectl get pdb / kubectl describe pdb
|
Confirm eviction limits |
Set rollout strategy | strategy.rollingUpdate |
Control surge / downtime |
Drain a node | kubectl drain <node> --ignore-daemonsets |
Simulate voluntary disruption |
Check rollout | kubectl rollout status deployment/<name> |
Wait for safe update |
View pod distribution | kubectl get pods -o wide |
Inspect zone/node distribution |
Pitfalls & Gotchas
Misconfigured probes kill healthy pods / false positives.
Probe timeouts too aggressive lead to unnecessary restarts. Tune after load testing.PDB deadlocks your upgrades.
If your PDB demandsminAvailable = replicas
, plus you setmaxUnavailable = 0
andmaxSurge = 0
, your rollout cannot make progress. Always allow some headroom (e.g.maxSurge: 1
) or loosen PDB.Skew violations during rolling updates.
Spread constraints sometimes misbehave during updates: the scheduler considers old and new pods together in balancing, so you may temporarily skew. UseScheduleAnyway
as fallback or trigger rescheduling.Asymmetric zones or resource imbalance.
If one zone has less capacity, strict constraints may block scheduling. Use a softerwhenUnsatisfiable: ScheduleAnyway
or allow some skew.Spread constraints don’t rebalance after scale-down.
When pods are removed, existing pods may end up unevenly distributed. Use a Descheduler or manual intervention to rebalance.Startup probe too permissive or missing disables liveness protection.
Without startup probe, a slow boot may trigger liveness failure. With one, if too lenient, you delay detecting broken pods.
Wrap-up & Bridge to Post 3
With probes, PDBs, topology spread, and rollout control, you now have a robust reliability foundation. Your services will survive node drains, upgrades, and zone outages while satisfying availability SLOs.
In Post 3, we’ll build on this: version upgrades, canary/blue-green deployments, cluster upgrades, and rollback strategies, so you can evolve your system safely under load.
Diagram 1 — Rolling update timeline
Diagram 2 — Zone spread visualization
Drop a comment if you learned something new and share your thoughts. Thank You.
This content originally appeared on DEV Community and was authored by cloud-sky-ops

cloud-sky-ops | Sciencx (2025-10-05T02:37:00+00:00) Post 2/10 — Reliability by Design: Probes, PodDisruptionBudgets, and Topology Spread Constraints. Retrieved from https://www.scien.cx/2025/10/05/post-2-10-reliability-by-design-probes-poddisruptionbudgets-and-topology-spread-constraints/
Please log in to upload a file.
There are no updates yet.
Click the Upload button above to add an update.