We adopted Kubernetes to streamline our deployment process, enabling faster service launches, reliable feature rollouts, and efficient scaling. Using AWS's managed Kubernetes offering (EKS) initially simplified our infrastructure management, but as our application grew in scale and complexity, we faced several unexpected challenges.
In this post, we'll share insights from our journey, highlighting the scaling challenges we encountered and how we tackled them.
1. Scaling: It's Never Just One Switch
Cluster Autoscaler Wasn't Enough
EKS integrates easily with the Kubernetes Cluster Autoscaler, but that doesn't make it production-ready. We ran into issues like:
- Slow response to spiky traffic surges
- Pods stuck in pending state while waiting for node scale-up
- Over-provisioning as a brute-force workaround to prevent lag
Autoscaler works on a reactive loop, and that loop just wasn't fast enough for our dynamic workloads.
Karpenter Solved One Problem, Unlocked Many Others
Karpenter entered as a game-changer. It's faster, smarter, and directly responds to unscheduled pods.
Wins:
- Burst scaling improved dramatically
- Instance types were chosen optimally
- Cold-start delays reduced significantly
New pain:
- Required deep understanding of taints, tolerations, affinities
- Provisioning configs ballooned in complexity
- Consolidation logic introduced unintended evictions
Karpenter is great — but needs maturity and attention to workload profiles.
Placeholder Pods: Pre-Warming the Cluster
To further reduce cold start delays and smoothen burst handling, we introduced the concept of placeholder pods with low priority.
These are lightweight, idle pods that occupy node capacity. They:
- Ensure warm nodes are already running
- Are assigned a low
PriorityClassvalue - Get preempted immediately when real workloads arrive

This technique helped us:
- Keep enough nodes alive to avoid scale-up lag
- Avoid wasteful over-provisioning
- Smoothly absorb bursts by evicting placeholders instantly
Tip: Placeholder pods are a cost-aware alternative to full warm-pool setups.
2. Networking: Where "It Just Works" Breaks Down
Ingress Wars: AWS ALB vs. NGINX
We had two paths:
- AWS ALB Ingress Controller: Easy start, slow updates, minimal control
- NGINX Ingress Controller: More power, more flexibility, but full DIY



We chose NGINX for its routing customizations support. Initially, we deployed a centralized NGINX Ingress Controller that handled traffic for all applications. This setup gave us the advantage of using a single load balancer with multiple ingress definitions, routing internally by hostnames.
However, this architecture introduced a major pain point: noisy neighbor effect. A problematic or overloaded application could overwhelm the shared group of NGINX pods, impacting traffic for all other applications.
To isolate failure domains, we split the NGINX Ingress Controller deployments by application category. High-priority applications were given dedicated ingress controllers and node pools, ensuring traffic isolation and better reliability.
But again, maintenance became painful. Monitoring and alerting across multiple NGINX pods led to excessive noise. Logs were hard to correlate, and debugging ingress issues became a time sink.
Meanwhile, we were experimenting with Argo Rollouts for canary deployments, which made native AWS integration increasingly valuable. This led us to revisit our decision and eventually switch back to AWS ALB Ingress Controller. It simplified our setup, aligned better with our GitOps workflows, and reduced operational overhead.
Lesson: NGINX offers flexibility, but managing it at scale can become a platform in itself. Evaluate ingress through the lens of operational simplicity, not just feature set.
503s and Timeouts: Revisiting the Interfaces
When we started seeing intermittent 503 Service Unavailable errors, the instinct was to tune timeouts or bump pod counts. But over time, it became clear: solving for 5xxs requires rethinking the entire chain of interfaces between components.
Where the Interfaces Break:
NGINX ↔ Application Pods
- Misaligned timeouts: NGINX would timeout before the app had a chance to respond.
- Keepalive mismanagement: Connections were reused aggressively even when upstream pods weren't ready.
- Readiness probe gaps: Pods marked "ready" while still initializing critical services.
Application ↔ Node Runtime
- During termination, apps exited before closing HTTP sockets, resulting in mid-request failures.
- Lack of preStop handling or misconfigured terminationGracePeriodSeconds.

Ingress ↔ Load Balancer
- Load balancer sending traffic to pods that are deregistered by ALB controller
- ALB/NLB timeouts weren't aligned with ingress-level and pod-level behavior.
How We Fixed It:
Extended application shutdown logic:
- Graceful connection teardown in Node.js
- Delayed SIGTERM handling
Aligned termination settings:
- preStop hook duration ≈ app shutdown time
- terminationGracePeriodSeconds
Synchronized ALB deregistration delay with pod lifecycle
Lesson: A 503 isn't "just a timeout" — it's a signal that something in your handoff chain is misaligned. Solve 5xxs by aligning every interface: from LB to ingress, from ingress to pod, from pod to app runtime.
DNS: The Silent Killer
CoreDNS came pre-installed, but was far from production-ready:
- No HPA by default
- Default ndots:5 led to excessive DNS lookups
- With increase in scale, we even moved coreDNS deployment to dedicated node pool
Fixes:
- Patched CoreDNS with HPA and tuned resource limits
- Reduced ndots to 2, avoiding redundant resolution attempts

Sample /etc/resolv.conf
search default.svc.cluster.local svc.cluster.local cluster.local ap-xx.compute.internal
nameserver XX.XX.XX.XX
options ndots:2Takeaway: At scale, every default in the network path is a liability. DNS settings, timeout values, and readiness behavior all need tuning.
3. Application Behavior: Small Decisions, Big Impact
Pod Sizing Matters
Autoscalers depend on accurate CPU/memory requests. We had to iterate heavily:
- Too low? Throttling and instability
- Too high? wasted capacity
We also experimented with running some Node.js workloads without CPU limits to observe how they behave under unrestricted conditions. Interestingly, we didn't face noisy neighbor issues. This is because a single Node.js process is primarily single-threaded and typically maxes out at 1 vCPU. Even without explicit CPU limits, the pods weren't able to hog additional CPU beyond their single-threaded capacity. — where one pod could consume excessive CPU and affect other pods on the same node.
Improved by introducing resource estimation tooling and continuous profiling.
Graceful Termination: The Overlooked Bottleneck
We observed 503s during rolling deploys or shuffling of pods. Cause? Pods were terminating before traffic was drained.
Fixes:
- Added
preStophooks - Increased
terminationGracePeriodSeconds - Ensured NGINX respected termination delays
- Ensured application (Node.js) gracefully handled
SIGTERM, HTTP keep-alive, and connection closures
Note: All these timeouts (probe fail thresholds, preStop, grace period, keepalive) must be coordinated to prevent premature termination.
When behind an ALB, target deregistration delays must also be configured correctly to allow in-flight requests to complete.
HPA Tuning for Workload Types
Workloads scaled too fast or too late. We fixed this with:
- Stabilization windows
- Per-deployment HPA policies
Takeaway: Your pods aren't just compute containers. They are entities that need care, observability, and lifecycle tuning.
TL;DR: Key Lessons
Here are some hard-won insights from scaling EKS in production:
Scaling Karpenter outperforms Cluster Autoscaler for bursty traffic, but it demands precise configuration and workload awareness.
Networking
You'll need to tune DNS (especially ndots and CoreDNS autoscaling), carefully configure ingress timeouts, and ensure your probes match real app readiness.
Application Pod sizing isn't a set-it-and-forget-it. Proper CPU/memory requests, lifecycle hooks, and HPA tuning have a huge impact on reliability.
EKS Reality EKS manages the control plane — but everything else (scaling, networking, observability) is still yours to get right.
Conclusion
Scaling on EKS taught us that while AWS removes some of the Kubernetes complexity, it still requires serious investment in engineering effort, observability, and tuning. From DNS configs to autoscaling logic, from ingress choices to debugging obscure 5xx errors — each layer had its own story of trial and refinement.
We're continuously learning and refining our setup. The key takeaway? A reliable, production-grade Kubernetes environment doesn't happen by default — it requires deliberate design and engineering.
Have any questions or want to know more? Ask away in the comments — we're happy to share!