(A Backend Engineer's Playbook for Staying Calm Under Pressure)
Introduction: The First Time Production Broke
The first time I was on-call when production went down, I panicked.
Dashboards were red. Support tickets were coming in. Slack was noisy. Everyone was asking: "What happened?"
And I had no clear answer.
I jumped between logs, metrics, and code without a plan. I tried random fixes. Nothing worked fast.
That incident taught me something important:
Handling production issues is less about being fast โ and more about being calm and systematic.
Over time, after dealing with multiple L3/L4 incidents, I developed a simple mental model to debug issues without losing control.
This is the approach I use today.
1. My First Rule: Don't React. Stabilize.
When an incident starts, the instinct is to "do something" immediately.
That's dangerous.
My first goal is always stability, not root cause.
I ask:
- Is the system still serving users?
- Can we reduce impact quickly?
- Do we need a rollback or temporary mitigation?
Examples:
- Scale up instances
- Disable a problematic feature flag
- Route traffic away
- Roll back last deployment
This buys time.
Time = clarity.
Without stability, every investigation becomes rushed and messy.
2. My Mental Model: Three Questions
Once things are stable, I focus on three questions:
1๏ธโฃ What Changed?
Most incidents are caused by change.
- New deployment?
- Config update?
- Data migration?
- Traffic spike?
- External dependency issue?
I always check this first.
No change โ investigation becomes harder.
Change โ investigation becomes focused.
2๏ธโฃ Where Is It Failing?
Not "what is broken", but where.
- API layer?
- Database?
- Cache?
- Queue?
- External service?
This narrows the search space.
Instead of "everything is slow", I try to reach:
"Requests are slow between Service A and DB."
3๏ธโฃ Why Is It Failing?
Only after the first two.
This is where root cause lives:
- Resource exhaustion
- Bad query
- Race condition
- Missing validation
- Timeout mismatch
Jumping to "why" too early usually leads to wrong conclusions.
3. My Debugging Checklist
Over time, I built a personal checklist. I follow it almost mechanically during incidents.
โ Step 1: Check Monitoring
- Latency
- Error rate
- Throughput
- CPU / Memory
- DB connections
This tells me if it's system-wide or localized.
โ Step 2: Check Logs (With Purpose)
I don't "scroll logs".
I search for:
- Correlation IDs
- Error patterns
- First occurrence time
- Repeated failures
Random log reading wastes time.
โ Step 3: Reproduce (If Possible)
If I can reproduce:
- In staging
- With sample payload
- With specific user
Debugging becomes 10ร faster.
If I can't, I rely more on metrics and traces.
โ Step 4: Trace the Request Path
I mentally trace:
Client โ Gateway โ Service โ DB โ External API โ ResponseWhere does it slow down or fail?
Distributed tracing helps a lot here.
โ Step 5: Validate Assumptions
During incidents, assumptions multiply.
I constantly ask:
"Do I have evidence for this?"
If not, I verify.
4. How I Prioritize During an Incident
Not all tasks are equal during production issues.
My priority order:
๐ฅ 1. User Impact
Anything that reduces user pain comes first.
๐ฅ 2. Data Safety
Avoid corruption, duplication, or loss.
๐ฅ 3. Service Stability
Prevent cascading failures.
๐ฅ 4. Root Cause
Only after the above.
Fixing root cause while users are suffering is often the wrong trade-off.
5. Communication: The Hidden Skill
Early in my career, I underestimated communication.
Now I see it as part of engineering.
During incidents, I:
โ Share Status Regularly
Even if there is no update:
"Still investigating DB latency. Next update in 15 mins."
Silence increases anxiety.
โ Separate Facts from Hypotheses
I clearly say:
- Fact: "DB connections are maxed out."
- Hypothesis: "Possibly due to new batch job."
This builds trust.
โ Avoid Blame
Incidents are system failures, not people failures. Blame kills collaboration.
6. How This Approach Changed My Work
Since following this approach:
- I resolve incidents faster
- I make fewer risky fixes
- I feel less stressed
- Teams trust my updates
- Postmortems are more useful
Most importantly, I stopped feeling "out of control" during outages.
7. What I Wish I Knew Earlier
If I could tell my younger self one thing:
You don't need to know everything during an incident. You need a process.
Calm comes from structure.
Not from experience alone.
Conclusion: Panic Is Optional
Production issues will never stop.
Systems grow. Complexity increases. Failures happen.
But panic is optional.
With a clear mental model, a simple checklist, and honest communication, you can handle incidents with confidence โ even when everything looks broken.
My Personal Rule Today
Before every major action, I ask:
"Is this helping users right now, or just helping me feel busy?"
That question keeps me grounded.