This content originally appeared on DEV Community and was authored by Ukeme David Eseme
A comprehensive tale of migrating a production AWS Kubernetes cluster with 6000+ resources, 46 CRDs, 7 SSL certificates, 12 Namespaces and zero downtime
Introduction: The Challenge Ahead
Upgrading a production-grade Kubernetes cluster is never a walk in the park—especially when it spans multiple environments, critical workloads, and tight deadlines.
So when it was time to migrate a clients 3-4 years old Amazon EKS cluster from v1.26 to v1.33, I knew it wouldn’t just be a version bump—it would be a battlefield.
This cluster wasn't just any cluster—it was a complex ecosystem running critical healthcare applications with:
- 46 Custom Resource Definitions (CRDs) across multiple systems
- 7 production domains with SSL certificates
- Critical data in PostgreSQL databases
- Zero downtime tolerance for production services
- Complex networking with Istio service mesh
- Monitoring stack with Prometheus and Grafana
This is the story of how we successfully migrated this beast using a hybrid approach, the challenges we faced, and the lessons we learned along the way.
Chapter 1: The Reconnaissance Phase
Mapping the Battlefield
Before diving into the migration, we needed to understand exactly what we were dealing with. There was no gitops, no manifest files, what we got was AWS access, Lens and an outdated cluster that needs to be upgraded.
Kubernetes enforces a strict version skew policy, especially when you’re using managed services like Elastic Kubernetes Service (EKS).
The control plane must always be:
- one minor version ahead of the kubelets (worker nodes).
- All supporting tools—kubeadm, kubelet, kubectl, and add-ons—must also respect this version skew policy.
So what does this mean?
- If your control plane is running v1.33, your worker nodes can only be on v1.32 or v1.33. Nothing lower.
- And no, you can’t jump straight from v1.26 to v1.33. You must upgrade sequentially: v1.26 → v1.27 → v1.28 → ... → v1.33
Each upgrade step? A potential minefield of broken dependencies, deprecated APIs, and mysterious behavior.
💀 The Aging Cluster
The cluster I inherited was running Kubernetes v1.26—with some workloads and CRDs that hadn’t been touched in about 4 years.
It was ancient. It was fragile. And it was about to get a rude awakening.
🧪 First Attempt: The “By-the-Book” Upgrade
I tried to play nice.
The goal: Upgrade the cluster manually, step-by-step from v1.26 **all the way to **v1.33.
But the moment I moved from v1.26 → v1.27, the floodgates opened:
Pods crashing from all directions,
Incompatible controllers acting out,
Deprecation warnings lighting up the logs like Christmas trees.
Let’s just say—manual upgrades were off the table.
🛠️ Second Attempt: The Manifest Extraction Strategy
Time to pivot.
The new plan?
Spin up a fresh EKS cluster running v1.33, then lift-and-shift resources from the old cluster.
Step 1: Extract All Resources
From the old cluster I ran:
kubectl get all --all-namespaces -o yaml > all-resources.yaml
Then I backed up other critical components:
- ConfigMaps
- Secrets
- PVCs
- Ingresses
- CRDs
- RBAC
- ServiceAccounts
kubectl get configmaps,secrets,persistentvolumeclaims,ingresses,customresourcedefinitions,roles,rolebindings,clusterroles,clusterrolebindings,serviceaccounts --all-namespaces -o yaml > extras.yaml
Step 2: Apply to the New Cluster
Switched context:
kubectl config use-context <cluster-arn>
And then:
kubectl apply -f all-resources.yaml extras.yaml
Boom—in one swoop, everything started deploying into the new cluster.
For a moment, I thought:
“Wow… that was easy. Too easy.”
🚨 Reality Check: The Spaghetti Hit the Fan
After 8 hours of hopeful waiting, the nightmare unfolded:
- CrashLoopBackOff
- ImagePullBackOff
- Pending Pods
- Service Not Reachable
- VolumeMount and PVC errors everywhere
It was YAML spaghetti, tangled and broken.
The old cluster’s legacy configurations simply did not translate cleanly to the modern version.
And now, I had to dig in deep—resource by resource, namespace by namespace, to rebuild sanity, which i didn't have the time and luxury for.
⚙️ Third Attempt: Enter Velero
The next strategy? Use Velero.
Install it in the old cluster, run a full backup, switch contexts, and restore everything into the shiny new v1.33 cluster.
Simple, right?
Not quite.
Velero pods immediately got stuck in Pending.
Why?
- Insufficient resources in the old cluster
- CNI-related issues that blocked network provisioning
So instead of backup and restore magic, I found myself deep in another rabbit hole.
🧠 Fourth Attempt: Organized Manifest Extraction — The Breakthrough
Out of frustration, I raised the issue during a session in the AWS DevOps Study Group.
That’s when Theo and Jaypee stepped in with game-changing advice:
“Forget giant YAML dumps. Instead, extract manifests systematically, grouped by namespace and resource type. Organize them in folders. Leverage Amazon Q in VS Code to make sense of the structure.”
It was a lightbulb moment💡.
I restructured the entire migration approach based on their idea—breaking down the cluster into modular, categorized directories.
It brought clarity, control, and confidence back to the process.
📦 The CRD Explosion
Once things were neatly organized, the real scale of the system came into focus.
Major CRDs We Had to Handle:
- Istio Service Mesh: 12 CRDs managing traffic routing and security
- Prometheus/Monitoring: 8 CRDs for metrics and alerting
- Cert-Manager: 7 CRDs handling SSL certificate automation
- Velero Backup: 8 CRDs for disaster recovery
- AWS Controllers: 11 CRDs for cloud integration
🧮 Total: 46 CRDs — each one a potential migration minefield
🔍 Custom Resources Inventory
Beyond the CRDs themselves, the custom resources were no less intimidating:
- 11+ TLS Certificates across multiple namespaces
- 6+ ServiceMonitors for Prometheus scraping
- Multiple PrometheusRules for alerting
- VirtualServices and DestinationRules for Istio routing
The message was clear:
This wasn’t a “one-file kubectl apply” kind of migration.
✅ API Compatibility Victory
With the structure in place, we ran API compatibility checks using Pluto and a custom script generated via Amazon Q in VS Code:
./scripts/api-compatibility-check.sh
Result:
✅ No deprecated or incompatible API versions found.
A small win—but a huge morale boost in a complex migration journey.
📦 Chapter 2: The Data Dilemma
💡 Choosing Our Weapon: Manual EBS Snapshots
When it came to migrating persistent data, we faced a critical decision. Several options were on the table:
- Velero backups – our usual go-to, but ruled out due to earlier issues with pod scheduling and CNI errors.
- Database dumps – possible, but slow, error-prone, and fragile under pressure.
- Manual EBS snapshots – low-level, reliable, and simple
After weighing the risks, we went old-school with manual EBS snapshots.
They offered direct access to data volumes with minimal tooling—and in a high-stakes migration, simplicity is a virtue.
Sometimes, the old ways are still the best ways.
🛠️ Automation to the Rescue
To streamline the snapshot process, I wrote a simple backup script:
./scripts/manual-ebs-backup.sh
It handled the tagging and snapshot creation for each critical volume, ensuring traceability and rollback capability.
🔐 Critical Volumes Backed Up
Here are some of the most important data volumes we preserved:
-
tools/pgadmin-pgadmin4
→snap-06257a13c49e125b1
-
sonarqube/data-sonarqube-postgresql-0
→snap-0e590f608a631fcc3
Each snapshot became a lifeline, preserving vital stateful components of our workloads as we prepped the new cluster.
🏗️ Chapter 3: Building the New Kingdom
Once the old cluster was archived and dissected, it was time to construct the new realm—clean, modern, and battle-hardened.
⚙️ The Foundation: CRD Installation Order Matters
One of the most overlooked but mission-critical lessons we learned during this journey:
The order in which you install your CRDs can make or break your cluster.
Install them in the wrong sequence, and you’ll find yourself swimming in cryptic errors, broken controllers, and cascading failures that seem to come from nowhere.
After a lot of trial, and error (especially with istio, gave me a lot of trouble), I landed on a battle-tested CRD deployment sequence:
# 1. Cert-Manager (many other components rely on it for TLS provisioning)
helm install cert-manager jetstack/cert-manager \
--namespace cert-manager \
--create-namespace \
--set installCRDs=true
# 2. Monitoring Stack (metrics, alerting, dashboards)
helm install prometheus-stack prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace
# 3. AWS Integration (Load balancer controller, IAM roles, etc.)
helm install aws-load-balancer-controller eks/aws-load-balancer-controller \
-n kube-system
# 4. Service Mesh (Istio control plane)
istioctl install --set values.defaultRevision=default
🧘 **Pro Tip: **After each installation, wait until the operator and all dependent pods are fully healthy before continuing.
Kubernetes is fast… but rushing this step will cost you hours down the line.
🧬 Data Resurrection: Bringing Back Our State
With the new infrastructure laid out, it was time to resurrect the lifeblood of the platform—its data.
Using our EBS snapshots from earlier, we restored the volumes and re-attached them to their rightful claimants:
bash scripts/restore-ebs-volumes.sh
Restored Volumes:
-
tools/pgadmin
→vol-0166bbae7bd2eb793
-
sonarqube/postgresql
→vol-0262e16e1bd5df028
Held my breath… and then—
✅ PersistentVolumes bound successfully
✅ StatefulSets recovered
✅ Pods restarted with their original data
It was official: our new kingdom had data, structure, and a beating heart.
🎭 Chapter 4: The Application Deployment Dance
The Dependency Choreography
Deploying applications in Kubernetes isn’t just about applying YAML files—it’s a delicate choreography of interdependent resources, where the order of execution can make or break your deployment.
Get the sequence wrong, and you’re looking at a cascade of errors:
missing secrets, broken RBAC, unbound PVCs, and pods stuck in limbo.
We approached it like conducting an orchestra—each instrument with its cue.
🪜 Step-by-Step Deployment Strategy
1. Foundation First: ServiceAccounts, ConfigMaps, and Secrets
These are the building blocks of your cluster environment.
No app should be launched before its supporting config and identity infrastructure are in place.
kubectl apply -f manifests/*/serviceaccounts/
kubectl apply -f manifests/*/configmaps/
kubectl apply -f manifests/*/secrets/
2. RBAC Granting the Right Access
Once identities are in place, we assign the right permissions using Roles and RoleBindings—especially for monitoring and system tools.
kubectl apply -f manifests/monitoring/roles/
kubectl apply -f manifests/monitoring/rolebindings/
⚠️ Lesson: Don’t skip this step or your logging agents and monitoring stack will sit silently—failing without errors.
3. Persistent Storage: Claim Before You Launch
Storage is like the stage on which your stateful applications perform.
We provisioned all PersistentVolumeClaims (PVCs) before deploying workloads to avoid CrashLoopBackOff errors related to missing mounts.
kubectl apply -f manifests/tools/persistentvolumeclaims/
kubectl apply -f manifests/sonarqube/persistentvolumeclaims/
4. Workloads: Let the Apps Take the Stage
With the foundation solid and access configured, it was time to deploy the actual workloads—both stateless and stateful.
kubectl apply -f manifests/tools/deployments/
kubectl apply -f manifests/sonarqube/statefulsets/
# ... and the rest
Status: Applications Deployed and Running ✅
At first glance, everything seemed perfect—pods were green, services were responsive, and dashboards were lighting up.
I exhaled.
But the celebration didn’t last long.
Behind those green pods were networking glitches, DNS surprises, and service discovery issues lurking in the shadows—ready to pounce.
🔐 Chapter 5: The Great SSL Certificate Saga
Just when I thought the migration was complete and everything was running smoothly, the ghost of SSL past returned to haunt.
The Mystery of the Expired Certificates
Just when we thought we were done, we discovered a critical issue:
NAMESPACE NAME CLASS HOSTS PORTS AGE
qaclinicaly bida-fe-clinicaly <none> bida-fe-qaclinicaly.example.net 80,443 59s
At first glance, it looked fine. But a quick curl and browser visit revealed a nasty surprise:
“Your connection is not private”
“This site’s security certificate expired 95 days ago”
Another issue, that should cause panic and confusion, but I was calm. We can fix this!
Upon further inspection, every certificate in the cluster was showing:
READY: False
Cert-manager was deployed. The pods were healthy. But nothing was being issued.
🔎 The Missing Link: ClusterIssuer
Digging deeper into the logs, I found the root cause:
The ClusterIssuer for Let’s Encrypt was missing entirely.
Without it, Cert-Manager had no idea how to obtain or renew certificates.
Somehow, it had slipped through the cracks during our migration process.
🛠️ The Quick Fix
Recreated the missing ClusterIssuer using the standard ACME configuration:
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: letsencrypt-prod
spec:
acme:
server: https://acme-v02.api.letsencrypt.org/directory
email: good-devops@example.com
privateKeySecretRef:
name: letsencrypt-prod
solvers:
- http01:
ingress:
class: nginx
Applied it to the cluster
kubectl apply -f cluster-issuer.yaml
Despite the ClusterIssuer being present and healthy, the certificates still wouldn’t renew. The plot thickened...
⚠️ Chapter 6: The AWS Load Balancer Controller Nightmare
Just when I thought the worst was behind me, the AWS Load Balancer Controller decided to stir up fresh chaos.
🧩 The IAM Permission Maze
The first clue came from the controller logs—littered with authorization errors like this:
"error":"operation error EC2: DescribeAvailabilityZones, https response error StatusCode: 403, RequestID: 3ba25abe-7bb2-4b05-bb33-26fde9696931, api error UnauthorizedOperation: You are not authorized to perform this operation"
That 403 told me everything I needed to know:
The controller lacked the necessary IAM permissions to interact with AWS APIs.
What followed was a deep dive into the AWS IAM Policy abyss—where small misconfigurations can lead to hours of head-scratching and trial-and-error debugging.
🔐 The Fix: A Proper IAM Role and Trust Policy
To get the controller working, I created a dedicated IAM role with the required permissions using Amazon Q, and then annotated the Kubernetes service account to assume it.
# Create the IAM role
aws iam create-role \
--role-name AmazonEKS_AWS_Load_Balancer_Controller \
--assume-role-policy-document file://aws-lb-controller-trust-policy.json
# Attach the managed policy
aws iam attach-role-policy \
--role-name AmazonEKS_AWS_Load_Balancer_Controller \
--policy-arn arn:aws:iam::830714671200:policy/AWSLoadBalancerControllerIAMPolicy
# Annotate the controller's service account in Kubernetes
kubectl annotate serviceaccount aws-load-balancer-controller \
-n kube-system \
eks.amazonaws.com/role-arn=arn:aws:iam::830714671200:role/AmazonEKS_AWS_Load_Balancer_Controller \
--overwrite
With the IAM role in place and attached, I expected smooth sailing—but Kubernetes had other plans.
🌐 The Internal vs Internet-Facing Revelation
Even with the right permissions, certificates still weren’t issuing.
Let’s Encrypt couldn’t validate the ACME HTTP-01 challenge—and I soon discovered why.
Running this command:
aws elbv2 describe-load-balancers \
--names k8s-ingressn-ingressn-9a8b080581 \
--region eu-central-1 \
--query 'LoadBalancers[0].Scheme'
Returned:
json
"internal"
The NGINX ingress LoadBalancer was internal, which made it unreachable from the internet—completely blocking Let’s Encrypt from reaching the verification endpoint.
🛠️ The Fix: Force Internet-Facing Scheme
I updated the annotation on the NGINX controller service:
kubectl annotate svc ingress-nginx-controller \
-n ingress-nginx \
service.beta.kubernetes.io/aws-load-balancer-scheme=internet-facing \
--overwrite
This change recreated the LoadBalancer, this time with internet-facing access.
🌐 Chapter 7: The DNS Migration Challenge
The Automated Solution
Once the internet-facing LoadBalancer was live and SSL certs were flowing, there was still one critical piece left: DNS.
The new LoadBalancer came with a new DNS name, and I had seven production domains that needed to point to it.
Doing this manually in the Route 53 console?
Slow. Risky. Error-prone.
⚙️ The Automated Solution
To avoid mistakes and speed things up, I wrote a script to automate the DNS updates using the AWS CLI.
#!/bin/bash
HOSTED_ZONE_ID="Z037069025V45CB576XJD"
NEW_LB="k8s-ingressn-ingressn-testing12345-9287c75b76ge25zc.elb.eu-central-1.amazonaws.com"
NEW_LB_ZONE_ID="Z3F0SRJ5LGBH90"
update_dns_record() {
local domain=$1
aws route53 change-resource-record-sets \
--hosted-zone-id "$HOSTED_ZONE_ID" \
--change-batch "{
\"Changes\": [{
\"Action\": \"UPSERT\",
\"ResourceRecordSet\": {
\"Name\": \"$domain\",
\"Type\": \"A\",
\"AliasTarget\": {
\"DNSName\": \"$NEW_LB\",
\"EvaluateTargetHealth\": true,
\"HostedZoneId\": \"$NEW_LB_ZONE_ID\"
}
}
}]
}"
}
By calling update_dns_record with each domain, I was able to quickly and safely redirect traffic to the new cluster.
✅ Domains migrated:
Here are the domains I successfully updated:
kafka-dev.example.net
pgadmin-dev.example.net
sonarqube.example.net
bida-fe-qaclinicaly.example.net
bida-gateway-qaclinicaly.example.net
bida-fe-qaprod.example.net
eduaid-admin-qaprod.example.net
Each one now points to the new LoadBalancer, resolving to the right service in the new EKS cluster.
🏁 Chapter 8: The Final Victory
⚔️ The Moment of Truth
After battling through IAM issues, LoadBalancer headaches, DNS rewiring, and countless YAML files, it all came down to one final moment: Would the certificates issue successfully?
I decided to start fresh and purge any leftover Cert-Manager resources to ensure there were no stale or broken states hanging around:
# Clean slate approach
kubectl delete challenges --all --all-namespaces
kubectl delete orders --all --all-namespaces
kubectl delete certificates --all --all-namespaces
Then I waited.....
Refreshed.....
Checked logs.....
Waited some more....
✅ And Then—Success
NAMESPACE NAME READY SECRET AGE
qaclinicaly bida-fe-qaclinicaly.example.net-crt True bida-fe-qaclinicaly.example.net-crt 3m
qaclinicaly bida-gateway-qaclinicaly.example.net-crt True bida-gateway-qaclinicaly.example.net-crt 3m
qaprod bida-fe-qaprod.example.net-crt True aida-fe-qaprod.example.net-crt 2m59s
qaprod eduaid-admin-qaprod.example.net-crt True eduaid-admin-qaprod.example.net-crt 2m59s
sonarqube sonarqube.example.net-crt True sonarqube.example.net-crt 2m59s
tools kafka-dev.example.net-tls True kafka-dev.example.net-tls 2m59s
tools pgadmin-dev.example.net-tls True pgadmin-dev.example.net-tls 2m59s
ALL 7 CERTIFICATES flipped to : READY = TRUE 🎉
📘 Chapter 9: Lessons Learned
🔧 Technical Insights
- CRD Installation Order is Critical: Install core dependencies first. Cert-manager before anything else.
- IAM Permissions are Tricky: Minimal IAM policies might pass linting, but they’ll fail at runtime. Use comprehensive, purpose-built roles.
- LoadBalancer Schemes Matter: The difference between internal and internet-facing can break certificate validation entirely.
- DNS Automation Saves Time and Sanity: Manual Route 53 updates are error-prone. Automate with scripts and avoid the guesswork.
-
EBS Snapshots are Underrated:
Sometimes the simplest tools are the most reliable. EBS snapshots gave me peace of mind and fast recovery.
🧠 Operational Insights
Plan for the Unexpected:
SSL certificate issues took more time than the core migration itself.Automate Early, Automate Often:
The scripts I wrote saved hours and helped enforce repeatable processes.Document Everything:
Every command, every fix, every gotcha—write it down. It pays off when something goes wrong (and it will).Be Patient:
DNS propagation and cert validation can be slow. Don’t panic—just wait.Always Have a Rollback Plan:
Keeping the old cluster alive gave me confidence to move fast with less fear of failure.
🛠️ Custom Tools That Saved Us
-
scripts/update-dns-records.sh
- Automated DNS cutover -
scripts/manual-ebs-backup.sh
- Fast and reliable data backup -
letsencrypt-clusterissuer.yaml
- Enabled SSL cert automation - Comprehensive IAM policies - Smooth AWS integration with the load balancer controller
📊 Chapter 10: The Final Status
✅ Migration Scorecard
Area | Status |
---|---|
Infrastructure | 46 CRDs and all operators deployed ✅ |
Data Migration | EBS volumes restored successfully ✅ |
DNS Migration | All 7 domains updated ✅ |
SSL Certificates | All validated and active ✅ |
LoadBalancer | Internet-facing and functional ✅ |
Applications | Fully deployed and operational ✅ |
Performance Metrics
- Total Migration Time: ~18 hours (including troubleshooting)
- Downtime: 0 minutes (DNS cutover was seamless)
- Data Loss: 0 bytes
- Certificate Validation Time: 3 minutes (after fixes)
- DNS Propagation Time: 2-5 minutes
Conclusion: The Journey's End
What started as a routine Kubernetes version upgrade turned into an epic journey through the depths of AWS IAM policies, LoadBalancer configurations, and SSL certificate validation. We faced challenges we never expected and learned lessons that will serve us well in future migrations.
The key takeaway? Kubernetes migrations are never just about Kubernetes. They're about the entire ecosystem—DNS, SSL certificates, cloud provider integrations, and all the moving parts that make modern applications work.
Our hybrid approach using manual EBS snapshots proved to be the right choice for our use case. While it required more manual work upfront, it gave us confidence in our data integrity and a clear rollback path.
What's Next?
With our new v1.33 cluster running smoothly, we're already planning for the future:
- Implementing GitOps for better deployment automation
- Enhancing our monitoring and alerting
- Preparing for the next major version upgrade (with better automation!)
Final Words
To anyone embarking on a similar journey: expect the unexpected, automate everything you can, and always have a rollback plan. The path may be challenging, but the destination—a modern, secure, and scalable Kubernetes cluster—is worth every debugging session.
Migration Status: ✅ COMPLETE
The cluster is dead, long live the cluster!
This content originally appeared on DEV Community and was authored by Ukeme David Eseme

Ukeme David Eseme | Sciencx (2025-07-05T00:25:04+00:00) How I Survived the Great Kubernetes Exodus: Migrating EKS Cluster from v1.26 to v1.33 on AWS. Retrieved from https://www.scien.cx/2025/07/05/how-i-survived-the-great-kubernetes-exodus-migrating-eks-cluster-from-v1-26-to-v1-33-on-aws/
Please log in to upload a file.
There are no updates yet.
Click the Upload button above to add an update.