How I Approach Production Issues Without Panicking 🧨

(A Backend Engineer's Playbook for Staying Calm Under Pressure)

Sowndapan

~4 min read · February 7, 2026 (Updated: February 7, 2026) · Free: Yes

(A Backend Engineer's Playbook for Staying Calm Under Pressure)

Introduction: The First Time Production Broke

The first time I was on-call when production went down, I panicked.

Dashboards were red. Support tickets were coming in. Slack was noisy. Everyone was asking: "What happened?"

And I had no clear answer.

I jumped between logs, metrics, and code without a plan. I tried random fixes. Nothing worked fast.

That incident taught me something important:

Handling production issues is less about being fast — and more about being calm and systematic.

Over time, after dealing with multiple L3/L4 incidents, I developed a simple mental model to debug issues without losing control.

This is the approach I use today.

1. My First Rule: Don't React. Stabilize.

When an incident starts, the instinct is to "do something" immediately.

That's dangerous.

My first goal is always stability, not root cause.

I ask:

Is the system still serving users?
Can we reduce impact quickly?
Do we need a rollback or temporary mitigation?

Examples:

Scale up instances
Disable a problematic feature flag
Route traffic away
Roll back last deployment

This buys time.

Time = clarity.

Without stability, every investigation becomes rushed and messy.

2. My Mental Model: Three Questions

Once things are stable, I focus on three questions:

1️⃣ What Changed?

Most incidents are caused by change.

New deployment?
Config update?
Data migration?
Traffic spike?
External dependency issue?

I always check this first.

No change → investigation becomes harder.

Change → investigation becomes focused.

2️⃣ Where Is It Failing?

Not "what is broken", but where.

API layer?
Database?
Cache?
Queue?
External service?

This narrows the search space.

Instead of "everything is slow", I try to reach:

"Requests are slow between Service A and DB."

3️⃣ Why Is It Failing?

Only after the first two.

This is where root cause lives:

Resource exhaustion
Bad query
Race condition
Missing validation
Timeout mismatch

Jumping to "why" too early usually leads to wrong conclusions.

3. My Debugging Checklist

Over time, I built a personal checklist. I follow it almost mechanically during incidents.

✅ Step 1: Check Monitoring

Latency
Error rate
Throughput
CPU / Memory
DB connections

This tells me if it's system-wide or localized.

✅ Step 2: Check Logs (With Purpose)

I don't "scroll logs".

I search for:

Correlation IDs
Error patterns
First occurrence time
Repeated failures

Random log reading wastes time.

✅ Step 3: Reproduce (If Possible)

If I can reproduce:

In staging
With sample payload
With specific user

Debugging becomes 10× faster.

If I can't, I rely more on metrics and traces.

✅ Step 4: Trace the Request Path

I mentally trace:

Client → Gateway → Service → DB → External API → Response

Where does it slow down or fail?

Distributed tracing helps a lot here.

✅ Step 5: Validate Assumptions

During incidents, assumptions multiply.

I constantly ask:

"Do I have evidence for this?"

If not, I verify.

4. How I Prioritize During an Incident

Not all tasks are equal during production issues.

My priority order:

🥇 1. User Impact

Anything that reduces user pain comes first.

🥈 2. Data Safety

Avoid corruption, duplication, or loss.

🥉 3. Service Stability

Prevent cascading failures.

🥉 4. Root Cause

Only after the above.

Fixing root cause while users are suffering is often the wrong trade-off.

5. Communication: The Hidden Skill

Early in my career, I underestimated communication.

Now I see it as part of engineering.

During incidents, I:

✅ Share Status Regularly

Even if there is no update:

"Still investigating DB latency. Next update in 15 mins."

Silence increases anxiety.

✅ Separate Facts from Hypotheses

I clearly say:

Fact: "DB connections are maxed out."
Hypothesis: "Possibly due to new batch job."

This builds trust.

✅ Avoid Blame

Incidents are system failures, not people failures. Blame kills collaboration.

6. How This Approach Changed My Work

Since following this approach:

I resolve incidents faster
I make fewer risky fixes
I feel less stressed
Teams trust my updates
Postmortems are more useful

Most importantly, I stopped feeling "out of control" during outages.

7. What I Wish I Knew Earlier

If I could tell my younger self one thing:

You don't need to know everything during an incident. You need a process.

Calm comes from structure.

Not from experience alone.

Conclusion: Panic Is Optional

Production issues will never stop.

Systems grow. Complexity increases. Failures happen.

But panic is optional.

With a clear mental model, a simple checklist, and honest communication, you can handle incidents with confidence — even when everything looks broken.

My Personal Rule Today

Before every major action, I ask:

"Is this helping users right now, or just helping me feel busy?"

That question keeps me grounded.

#sla #production #bugs #playboo #software-engineering

How I Approach Production Issues Without Panicking 🧨

(A Backend Engineer's Playbook for Staying Calm Under Pressure)

(A Backend Engineer's Playbook for Staying Calm Under Pressure)

Introduction: The First Time Production Broke

1. My First Rule: Don't React. Stabilize.

2. My Mental Model: Three Questions

1️⃣ What Changed?

2️⃣ Where Is It Failing?

3️⃣ Why Is It Failing?

3. My Debugging Checklist

✅ Step 1: Check Monitoring

✅ Step 2: Check Logs (With Purpose)

✅ Step 3: Reproduce (If Possible)

✅ Step 4: Trace the Request Path

✅ Step 5: Validate Assumptions

4. How I Prioritize During an Incident

🥇 1. User Impact

🥈 2. Data Safety

🥉 3. Service Stability

🥉 4. Root Cause

5. Communication: The Hidden Skill

During incidents, I:

✅ Share Status Regularly

✅ Separate Facts from Hypotheses

✅ Avoid Blame

6. How This Approach Changed My Work

7. What I Wish I Knew Earlier

Conclusion: Panic Is Optional

My Personal Rule Today

Reporting a Problem