Things I Learned the Hard Way in Production (Java + Microservices) — And How They Changed the Way I…

Hey there 👋

Shubham Vartak

~4 min read · April 6, 2026 (Updated: April 6, 2026) · Free: No

There's a big difference between writing code… and running code in production.

In development, everything works:

Your APIs respond
Your database behaves
Your tests pass

But production?

Production is where assumptions break.

This article is not a checklist. It's a reflection of real production failures I've seen while working with Java, Spring Boot, and microservices — and more importantly, how those failures forced me to rethink how systems should be built.

If you're preparing for interviews or already working in backend, this is the layer of understanding that actually matters.

🌟 Access Alert! 🌟 If you're a member, just scroll and enjoy! Non-members, click here for full access.

Production Issues

1. The Day Everything Slowed Down (But Nothing Was "Down")

One of the most confusing incidents I encountered wasn't a crash. Nothing obvious failed.

The system was alive — but painfully slow.

APIs that used to respond in 100ms were now taking 2–3 seconds
CPU usage looked normal
No error logs

At first glance, nothing seemed wrong. That's what made it dangerous.

What Was Actually Happening?

After digging deeper (thread dumps, logs, DB metrics), we found the issue:

👉 Database connection pool exhaustion

Each request needed a DB connection. But:

Some queries were slow
Some connections weren't released quickly
Requests started waiting

And then this happened:

Waiting requests piled up
Threads got blocked
Latency exploded

The Real Lesson

We initially thought: 👉 "More threads = better performance"

But production taught us: 👉 Threads without resources = waiting queues

How We Fixed It

We didn't just increase pool size blindly. That would only delay the problem.

Instead:

Identified slow queries and optimized them
Added proper indexes
Tuned HikariCP pool size based on traffic
Introduced timeouts

spring:
  datasource:
    hikari:
      maximum-pool-size: 20
      connection-timeout: 30000

What Changed in My Thinking

Before: 👉 "Database is just storage"

After: 👉 Database is the heart of your system — and the easiest place to break it

2. When One Service Failed… And Everything Followed

Another incident started with a small failure.

The payment service went down for a few seconds.

That's normal, right?

Except… the entire system slowed down.

What Went Wrong?

Every service depended on the payment service.

When it failed:

Requests retried automatically
Threads got blocked waiting
Load increased
Other services got affected

This is called a cascading failure.

Why This Happens

In microservices, everything is connected.

If you don't control failures: 👉 Failure spreads faster than success

The Fix

We introduced a circuit breaker using Resilience4j.

@CircuitBreaker(name = "paymentService", fallbackMethod = "fallback")
public String callPayment() {
    return restTemplate.getForObject(url, String.class);
}

public String fallback(Exception e) {
    return "Payment service unavailable";
}

What Changed in My Thinking

Before: 👉 "If a service fails, just retry"

After: 👉 Retry is not always resilience — it can be amplification of failure

3. The Bug That Charged Users Twice

This one was painful.

Some users were charged twice for the same transaction.

No exceptions. No crashes. Just incorrect behavior.

What Happened?

User clicked "Pay"
Network delay occurred
Client retried request
Backend processed it again

Result: 👉 Duplicate payment

Root Cause

We assumed: 👉 "Requests will come once"

Production reality: 👉 Requests can be retried anytime

The Fix: Idempotency

We introduced an Idempotency-Key

@PostMapping("/pay")
public ResponseEntity<String> pay(
    @RequestHeader("Idempotency-Key") String key) {

if (repository.exists(key)) {
        return ResponseEntity.ok(repository.get(key));
    }
    // process payment
}

What Changed in My Thinking

Before: 👉 "API correctness = logic correctness"

After: 👉 API correctness = logic + retry safety

4. The Race Condition Nobody Saw Coming

We had an inventory system.

Simple logic:

Check stock
Deduct quantity

But occasionally… stock went negative.

Why?

Multiple requests hit at the same time.

Both:

Read stock = 10
Deduct 5
Save result

Final result: 👉 Stock = 0 (should be 0) But sometimes: 👉 Stock = -5

The Fix

We implemented optimistic locking

@Version
private Long version;

Or:

Used atomic DB operations
Added distributed locks where needed

What Changed in My Thinking

Before: 👉 "Code runs sequentially"

After: 👉 Production runs in parallel — and everything can collide

5. The Kafka Lag That Quietly Broke Everything

We used Kafka for async processing.

Everything worked fine… until traffic increased.

Then:

Events started piling up
Consumers couldn't keep up
Processing delays increased

Root Cause

Single consumer
Heavy processing per message
Not enough partitions

Fix

Increased partitions
Scaled consumers
Made processing lighter

What Changed in My Thinking

Before: 👉 "Async = scalable"

After: 👉 Async systems still need throughput planning

6. The Deployment That Took Down Production

We deployed a new version.

Within minutes: 👉 Errors everywhere

Root Cause

New code expected new DB schema
Old version still running
Incompatibility

Fix

We changed deployment strategy:

Backward-compatible DB changes
Blue-green deployment
Feature flags

What Changed in My Thinking

Before: 👉 "Deployment = release"

After: 👉 Deployment = risk management

If there's one thing production teaches you, it's this:

👉 Systems don't fail in obvious ways 👉 They fail in subtle, interconnected ways

And the real skill is not writing code.

It's:

Anticipating failures
Designing for them
Recovering gracefully

That's what interviewers are really testing when they ask system questions.

Before You Go…

If this felt real and useful:

☕ Buy me a coffee
💬 Comment: Have you faced something similar in production?
🔁 Follow me for more deep backend and system design content

Let's build systems that survive real-world chaos.

#java #spring-boot #microservices #software-engineering #bugs

Things I Learned the Hard Way in Production (Java + Microservices) — And How They Changed the Way I…

Hey there 👋

1. The Day Everything Slowed Down (But Nothing Was "Down")

What Was Actually Happening?

The Real Lesson

How We Fixed It

What Changed in My Thinking

2. When One Service Failed… And Everything Followed

What Went Wrong?

Why This Happens

The Fix

What Changed in My Thinking

3. The Bug That Charged Users Twice

What Happened?

Root Cause

The Fix: Idempotency

What Changed in My Thinking

4. The Race Condition Nobody Saw Coming

Why?

The Fix

What Changed in My Thinking

5. The Kafka Lag That Quietly Broke Everything

Root Cause

Fix

What Changed in My Thinking

6. The Deployment That Took Down Production

Root Cause

Fix

What Changed in My Thinking

Before You Go…

Reporting a Problem