Production-Grade Pain: Lessons From Scaling Kubernetes on EKS

We adopted Kubernetes to streamline our deployment process, enabling faster service launches, reliable feature rollouts, and efficient…

Aditya Chowdhry

Probo Engineering

· ~6 min read · July 25, 2025 (Updated: July 25, 2025) · Free: Yes

We adopted Kubernetes to streamline our deployment process, enabling faster service launches, reliable feature rollouts, and efficient scaling. Using AWS's managed Kubernetes offering (EKS) initially simplified our infrastructure management, but as our application grew in scale and complexity, we faced several unexpected challenges.

In this post, we'll share insights from our journey, highlighting the scaling challenges we encountered and how we tackled them.

1. Scaling: It's Never Just One Switch

Cluster Autoscaler Wasn't Enough

EKS integrates easily with the Kubernetes Cluster Autoscaler, but that doesn't make it production-ready. We ran into issues like:

Slow response to spiky traffic surges
Pods stuck in pending state while waiting for node scale-up
Over-provisioning as a brute-force workaround to prevent lag

Autoscaler works on a reactive loop, and that loop just wasn't fast enough for our dynamic workloads.

Karpenter Solved One Problem, Unlocked Many Others

Karpenter entered as a game-changer. It's faster, smarter, and directly responds to unscheduled pods.

Wins:

Burst scaling improved dramatically
Instance types were chosen optimally
Cold-start delays reduced significantly

New pain:

Required deep understanding of taints, tolerations, affinities
Provisioning configs ballooned in complexity
Consolidation logic introduced unintended evictions

Karpenter is great — but needs maturity and attention to workload profiles.

Placeholder Pods: Pre-Warming the Cluster

To further reduce cold start delays and smoothen burst handling, we introduced the concept of placeholder pods with low priority.

These are lightweight, idle pods that occupy node capacity. They:

Ensure warm nodes are already running
Are assigned a low PriorityClass value
Get preempted immediately when real workloads arrive

placeholder priority class

This technique helped us:

Keep enough nodes alive to avoid scale-up lag
Avoid wasteful over-provisioning
Smoothly absorb bursts by evicting placeholders instantly

Tip: Placeholder pods are a cost-aware alternative to full warm-pool setups.

2. Networking: Where "It Just Works" Breaks Down

Ingress Wars: AWS ALB vs. NGINX

We had two paths:

AWS ALB Ingress Controller: Easy start, slow updates, minimal control
NGINX Ingress Controller: More power, more flexibility, but full DIY

Sample request routing with Nginx Ingress

We chose NGINX for its routing customizations support. Initially, we deployed a centralized NGINX Ingress Controller that handled traffic for all applications. This setup gave us the advantage of using a single load balancer with multiple ingress definitions, routing internally by hostnames.

However, this architecture introduced a major pain point: noisy neighbor effect. A problematic or overloaded application could overwhelm the shared group of NGINX pods, impacting traffic for all other applications.

To isolate failure domains, we split the NGINX Ingress Controller deployments by application category. High-priority applications were given dedicated ingress controllers and node pools, ensuring traffic isolation and better reliability.

But again, maintenance became painful. Monitoring and alerting across multiple NGINX pods led to excessive noise. Logs were hard to correlate, and debugging ingress issues became a time sink.

Meanwhile, we were experimenting with Argo Rollouts for canary deployments, which made native AWS integration increasingly valuable. This led us to revisit our decision and eventually switch back to AWS ALB Ingress Controller. It simplified our setup, aligned better with our GitOps workflows, and reduced operational overhead.

Lesson: NGINX offers flexibility, but managing it at scale can become a platform in itself. Evaluate ingress through the lens of operational simplicity, not just feature set.

503s and Timeouts: Revisiting the Interfaces

When we started seeing intermittent 503 Service Unavailable errors, the instinct was to tune timeouts or bump pod counts. But over time, it became clear: solving for 5xxs requires rethinking the entire chain of interfaces between components.

Where the Interfaces Break:

NGINX ↔ Application Pods

Misaligned timeouts: NGINX would timeout before the app had a chance to respond.
Keepalive mismanagement: Connections were reused aggressively even when upstream pods weren't ready.
Readiness probe gaps: Pods marked "ready" while still initializing critical services.

Application ↔ Node Runtime

During termination, apps exited before closing HTTP sockets, resulting in mid-request failures.
Lack of preStop handling or misconfigured terminationGracePeriodSeconds.

Ingress ↔ Load Balancer

Load balancer sending traffic to pods that are deregistered by ALB controller
ALB/NLB timeouts weren't aligned with ingress-level and pod-level behavior.

How We Fixed It:

Extended application shutdown logic:

Graceful connection teardown in Node.js
Delayed SIGTERM handling

Aligned termination settings:

preStop hook duration ≈ app shutdown time
terminationGracePeriodSeconds

Synchronized ALB deregistration delay with pod lifecycle

Lesson: A 503 isn't "just a timeout" — it's a signal that something in your handoff chain is misaligned. Solve 5xxs by aligning every interface: from LB to ingress, from ingress to pod, from pod to app runtime.

DNS: The Silent Killer

CoreDNS came pre-installed, but was far from production-ready:

No HPA by default
Default ndots:5 led to excessive DNS lookups
With increase in scale, we even moved coreDNS deployment to dedicated node pool

Fixes:

Patched CoreDNS with HPA and tuned resource limits
Reduced ndots to 2, avoiding redundant resolution attempts

Sample /etc/resolv.conf

search default.svc.cluster.local svc.cluster.local cluster.local ap-xx.compute.internal
nameserver XX.XX.XX.XX
options ndots:2

Takeaway: At scale, every default in the network path is a liability. DNS settings, timeout values, and readiness behavior all need tuning.

3. Application Behavior: Small Decisions, Big Impact

Pod Sizing Matters

Autoscalers depend on accurate CPU/memory requests. We had to iterate heavily:

Too low? Throttling and instability
Too high? wasted capacity

We also experimented with running some Node.js workloads without CPU limits to observe how they behave under unrestricted conditions. Interestingly, we didn't face noisy neighbor issues. This is because a single Node.js process is primarily single-threaded and typically maxes out at 1 vCPU. Even without explicit CPU limits, the pods weren't able to hog additional CPU beyond their single-threaded capacity. — where one pod could consume excessive CPU and affect other pods on the same node.

Improved by introducing resource estimation tooling and continuous profiling.

Graceful Termination: The Overlooked Bottleneck

We observed 503s during rolling deploys or shuffling of pods. Cause? Pods were terminating before traffic was drained.

Fixes:

Added preStop hooks
Increased terminationGracePeriodSeconds
Ensured NGINX respected termination delays
Ensured application (Node.js) gracefully handled SIGTERM, HTTP keep-alive, and connection closures

Note: All these timeouts (probe fail thresholds, preStop, grace period, keepalive) must be coordinated to prevent premature termination.

When behind an ALB, target deregistration delays must also be configured correctly to allow in-flight requests to complete.

HPA Tuning for Workload Types

Workloads scaled too fast or too late. We fixed this with:

Stabilization windows
Per-deployment HPA policies

Takeaway: Your pods aren't just compute containers. They are entities that need care, observability, and lifecycle tuning.

TL;DR: Key Lessons

Here are some hard-won insights from scaling EKS in production:

Scaling Karpenter outperforms Cluster Autoscaler for bursty traffic, but it demands precise configuration and workload awareness.

Networking You'll need to tune DNS (especially ndots and CoreDNS autoscaling), carefully configure ingress timeouts, and ensure your probes match real app readiness.

Application Pod sizing isn't a set-it-and-forget-it. Proper CPU/memory requests, lifecycle hooks, and HPA tuning have a huge impact on reliability.

EKS Reality EKS manages the control plane — but everything else (scaling, networking, observability) is still yours to get right.

Conclusion

Scaling on EKS taught us that while AWS removes some of the Kubernetes complexity, it still requires serious investment in engineering effort, observability, and tuning. From DNS configs to autoscaling logic, from ingress choices to debugging obscure 5xx errors — each layer had its own story of trial and refinement.

We're continuously learning and refining our setup. The key takeaway? A reliable, production-grade Kubernetes environment doesn't happen by default — it requires deliberate design and engineering.

Have any questions or want to know more? Ask away in the comments — we're happy to share!

#kubernetes #aws-eks #software-engineering #sre #devops