(A Backend Engineer's Playbook for Staying Calm Under Pressure)

Introduction: The First Time Production Broke

The first time I was on-call when production went down, I panicked.

Dashboards were red. Support tickets were coming in. Slack was noisy. Everyone was asking: "What happened?"

And I had no clear answer.

I jumped between logs, metrics, and code without a plan. I tried random fixes. Nothing worked fast.

That incident taught me something important:

Handling production issues is less about being fast โ€” and more about being calm and systematic.

Over time, after dealing with multiple L3/L4 incidents, I developed a simple mental model to debug issues without losing control.

This is the approach I use today.

1. My First Rule: Don't React. Stabilize.

When an incident starts, the instinct is to "do something" immediately.

That's dangerous.

My first goal is always stability, not root cause.

I ask:

  • Is the system still serving users?
  • Can we reduce impact quickly?
  • Do we need a rollback or temporary mitigation?

Examples:

  • Scale up instances
  • Disable a problematic feature flag
  • Route traffic away
  • Roll back last deployment

This buys time.

Time = clarity.

Without stability, every investigation becomes rushed and messy.

2. My Mental Model: Three Questions

Once things are stable, I focus on three questions:

1๏ธโƒฃ What Changed?

Most incidents are caused by change.

  • New deployment?
  • Config update?
  • Data migration?
  • Traffic spike?
  • External dependency issue?

I always check this first.

No change โ†’ investigation becomes harder.

Change โ†’ investigation becomes focused.

2๏ธโƒฃ Where Is It Failing?

Not "what is broken", but where.

  • API layer?
  • Database?
  • Cache?
  • Queue?
  • External service?

This narrows the search space.

Instead of "everything is slow", I try to reach:

"Requests are slow between Service A and DB."

3๏ธโƒฃ Why Is It Failing?

Only after the first two.

This is where root cause lives:

  • Resource exhaustion
  • Bad query
  • Race condition
  • Missing validation
  • Timeout mismatch

Jumping to "why" too early usually leads to wrong conclusions.

3. My Debugging Checklist

Over time, I built a personal checklist. I follow it almost mechanically during incidents.

โœ… Step 1: Check Monitoring

  • Latency
  • Error rate
  • Throughput
  • CPU / Memory
  • DB connections

This tells me if it's system-wide or localized.

โœ… Step 2: Check Logs (With Purpose)

I don't "scroll logs".

I search for:

  • Correlation IDs
  • Error patterns
  • First occurrence time
  • Repeated failures

Random log reading wastes time.

โœ… Step 3: Reproduce (If Possible)

If I can reproduce:

  • In staging
  • With sample payload
  • With specific user

Debugging becomes 10ร— faster.

If I can't, I rely more on metrics and traces.

โœ… Step 4: Trace the Request Path

I mentally trace:

Client โ†’ Gateway โ†’ Service โ†’ DB โ†’ External API โ†’ Response

Where does it slow down or fail?

Distributed tracing helps a lot here.

โœ… Step 5: Validate Assumptions

During incidents, assumptions multiply.

I constantly ask:

"Do I have evidence for this?"

If not, I verify.

4. How I Prioritize During an Incident

Not all tasks are equal during production issues.

My priority order:

๐Ÿฅ‡ 1. User Impact

Anything that reduces user pain comes first.

๐Ÿฅˆ 2. Data Safety

Avoid corruption, duplication, or loss.

๐Ÿฅ‰ 3. Service Stability

Prevent cascading failures.

๐Ÿฅ‰ 4. Root Cause

Only after the above.

Fixing root cause while users are suffering is often the wrong trade-off.

5. Communication: The Hidden Skill

Early in my career, I underestimated communication.

Now I see it as part of engineering.

During incidents, I:

โœ… Share Status Regularly

Even if there is no update:

"Still investigating DB latency. Next update in 15 mins."

Silence increases anxiety.

โœ… Separate Facts from Hypotheses

I clearly say:

  • Fact: "DB connections are maxed out."
  • Hypothesis: "Possibly due to new batch job."

This builds trust.

โœ… Avoid Blame

Incidents are system failures, not people failures. Blame kills collaboration.

6. How This Approach Changed My Work

Since following this approach:

  • I resolve incidents faster
  • I make fewer risky fixes
  • I feel less stressed
  • Teams trust my updates
  • Postmortems are more useful

Most importantly, I stopped feeling "out of control" during outages.

7. What I Wish I Knew Earlier

If I could tell my younger self one thing:

You don't need to know everything during an incident. You need a process.

Calm comes from structure.

Not from experience alone.

Conclusion: Panic Is Optional

Production issues will never stop.

Systems grow. Complexity increases. Failures happen.

But panic is optional.

With a clear mental model, a simple checklist, and honest communication, you can handle incidents with confidence โ€” even when everything looks broken.

My Personal Rule Today

Before every major action, I ask:

"Is this helping users right now, or just helping me feel busy?"

That question keeps me grounded.