An experiment that revealed how we all miss the obvious — and what it teaches us about code review
Last April, my team at a fintech startup faced a puzzling production issue. A critical payment processing function was failing randomly — approximately 0.3% of transactions. Not frequent enough to trigger alerts immediately, but costly enough that we'd lost $43,000 over six weeks before anyone noticed the pattern.
Three senior engineers spent two days debugging. They found nothing.
That failure haunted me. So I turned it into an experiment.
The Experiment
I took the actual buggy function, sanitized it, and posted it on five different platforms: Reddit's r/golang, a private Slack community of 2,000+ engineers, HackerNews, our company's internal engineering channel, and Twitter.
The challenge was simple:
"This Go function processes payments but fails randomly in production. Can you spot the bug?"
I gave them this code:
package payment
import (
"context"
"errors"
"time"
)
type PaymentProcessor struct {
cache map[string]*Transaction
db Database
}
type Transaction struct {
ID string
Amount float64
Status string
ProcessedAt time.Time
}
func (p *PaymentProcessor) ProcessPayment(ctx context.Context, txnID string, amount float64) error {
// Check cache first
if cached, exists := p.cache[txnID]; exists {
if cached.Status == "completed" {
return errors.New("transaction already processed")
}
}
// Create new transaction
txn := &Transaction{
ID: txnID,
Amount: amount,
Status: "pending",
ProcessedAt: time.Now(),
}
// Store in cache
p.cache[txnID] = txn
// Process payment
err := p.processWithProvider(ctx, txn)
if err != nil {
txn.Status = "failed"
return err
}
txn.Status = "completed"
// Persist to database
return p.db.SaveTransaction(ctx, txn)
}
func (p *PaymentProcessor) processWithProvider(ctx context.Context, txn *Transaction) error {
// Simulated external payment provider call
time.Sleep(100 * time.Millisecond)
return nil
}The results shocked me.
The Results: 500 Responses, 89% Wrong
Out of 500 developers who responded:
- 267 (53%) said: "Race condition — need a mutex"
- 89 (18%) said: "Context not being used properly"
- 47 (9%) said: "Float for money is bad practice"
- 42 (8%) said: "No error handling for SaveTransaction"
- 55 (11%) found the actual bug
Here's what those 11% spotted — and what the other 89% missed.
The Real Bug (It's Subtle)
The bug is in these three lines:
// Store in cache
p.cache[txnID] = txn
// Process payment
err := p.processWithProvider(ctx, txn)
if err != nil {
txn.Status = "failed"
return err
}The problem: When processWithProvider fails, we update txn.Status to "failed" and return. But txn is already in the cache with status "pending". The cache never gets updated.
Next time the same transaction ID comes through (retry logic, customer refresh, etc.), this happens:
if cached, exists := p.cache[txnID]; exists {
if cached.Status == "completed" {
return errors.New("transaction already processed")
}
}
// cached.Status is "pending", so we continue...The code sees status "pending", not "completed", so it tries to process again. The payment provider rejects it (already attempted), returns an error, and we're back in the same loop.
Random failure explanation: The bug only manifests when:
- First attempt fails at the provider
- Customer retries before cache expires
- Provider hasn't cleaned up the failed attempt
This created that mysterious 0.3% failure rate.
Interview #1: The Senior Engineer Who Spotted It
Priya, 8 years experience, currently at a payments company:
"I've been burned by this exact pattern before. Not with payments, but with order processing. The moment I saw the cache update before the provider call, I knew. You're creating state before validating state.
The giveaway was the error path. Look — when it fails, you update the object, but that object is already shared. The cache holds a pointer, not a copy. That's the killer.
I see junior devs make this mistake all the time. They think 'I'll update it later if it fails' but forget that 'later' doesn't exist if you've already shared the reference."
Why 89% Missed It
I interviewed 23 developers who got it wrong. The patterns were fascinating:
Pattern 1: They Looked for Complex Problems
Marcus, 6 years experience:
"I spent 10 minutes analyzing the mutex situation. Map writes aren't thread-safe in Go, so my brain immediately went to 'race condition.' I wrote a whole response about sync.RWMutex before I even considered the simpler logic bug.
Looking back, I was trying to show off. Finding a race condition feels like catching a sneaky bug. Catching a basic pointer mistake feels… obvious? But obvious bugs are the ones that make it to production."
The lesson: We hunt for sophisticated bugs because finding them validates our expertise. We skip the basics.
Pattern 2: They Stopped After Finding A Problem
Jennifer, 4 years experience:
"I saw float64 for money and immediately commented about precision issues. That's a real problem! But it's not the problem causing random failures. I assumed that was the answer and stopped looking.
The exercise asked for 'the bug' and I found 'a bug' and declared victory. In a real code review, I would have kept reading. But in a challenge, I wanted to be first to answer."
The lesson: Code can have multiple issues. The first problem you find isn't necessarily the critical one.
Pattern 3: They Focused on Style, Not Behavior
David, 11 years experience:
"I wrote three paragraphs about how the context parameter should be checked, how error wrapping should use fmt.Errorf with %w, and how the database interface should be properly abstracted.
I gave zero feedback on the actual logic flow. All my comments were about code quality and idioms. Those matter, but they weren't causing production failures.
I think after you reach a certain level, you default to architectural criticism. You forget to trace the actual execution path."
The lesson: Best practices matter, but execution flow matters more. Style critiques don't catch logic bugs.
Interview #2: The Junior Engineer Who Spotted It
Alex, 1.5 years experience:
"I'm new to Go, so I couldn't comment on race conditions or idiomatic patterns. I just traced what happens step by step.
I literally wrote it out:
- Check cache — nothing there
- Create transaction object
- Put object in cache
- Call provider — FAILS
- Update object status to failed
- Return error
Then I thought: wait, what if this function gets called again? I traced it again:
- Check cache — found! Status is… still 'pending'?
- Oh. We changed the object but the cache points to the same object
I was sure I was wrong because it seemed too simple. But I commented anyway."
The most powerful insight: Alex's "weakness" (being new) became a strength. No assumptions, just careful reading.
The Fix (Multiple Approaches)
The developers who found the bug suggested three approaches:
Approach 1: Update Cache on All Paths
func (p *PaymentProcessor) ProcessPayment(ctx context.Context, txnID string, amount float64) error {
if cached, exists := p.cache[txnID]; exists {
if cached.Status == "completed" {
return errors.New("transaction already processed")
}
}
txn := &Transaction{
ID: txnID,
Amount: amount,
Status: "pending",
ProcessedAt: time.Now(),
}
p.cache[txnID] = txn
err := p.processWithProvider(ctx, txn)
if err != nil {
txn.Status = "failed"
p.cache[txnID] = txn // ← Update cache on failure path
return err
}
txn.Status = "completed"
p.cache[txnID] = txn // ← Update cache on success path
return p.db.SaveTransaction(ctx, txn)
}Pro: Simple, minimal change Con: Duplicated cache updates, easy to forget on new error paths
Approach 2: Cache After Success Only
func (p *PaymentProcessor) ProcessPayment(ctx context.Context, txnID string, amount float64) error {
if cached, exists := p.cache[txnID]; exists {
if cached.Status == "completed" {
return errors.New("transaction already processed")
}
if cached.Status == "failed" {
return errors.New("transaction previously failed")
}
}
txn := &Transaction{
ID: txnID,
Amount: amount,
Status: "pending",
ProcessedAt: time.Now(),
}
// Don't cache yet
err := p.processWithProvider(ctx, txn)
if err != nil {
txn.Status = "failed"
p.cache[txnID] = txn // Cache the failure
return err
}
txn.Status = "completed"
p.cache[txnID] = txn // Cache the success
return p.db.SaveTransaction(ctx, txn)
}Pro: Clear intent — cache represents finalized state Con: Doesn't prevent duplicate in-flight requests
Approach 3: Use Status-Specific Cache Keys (Our Choice)
func (p *PaymentProcessor) ProcessPayment(ctx context.Context, txnID string, amount float64) error {
completedKey := txnID + ":completed"
processingKey := txnID + ":processing"
// Check if already completed
if _, exists := p.cache[completedKey]; exists {
return errors.New("transaction already processed")
}
// Check if currently processing
if _, exists := p.cache[processingKey]; exists {
return errors.New("transaction already in progress")
}
txn := &Transaction{
ID: txnID,
Amount: amount,
Status: "pending",
ProcessedAt: time.Now(),
}
// Mark as processing
p.cache[processingKey] = txn
defer delete(p.cache, processingKey) // Clean up processing marker
err := p.processWithProvider(ctx, txn)
if err != nil {
txn.Status = "failed"
return err
}
txn.Status = "completed"
p.cache[completedKey] = txn // Mark as completed
return p.db.SaveTransaction(ctx, txn)
}Pro: Prevents duplicate in-flight requests, clear state separation Con: More complex, more cache keys to manage
We went with Approach 3. It solved both the immediate bug and a secondary issue we hadn't noticed: duplicate simultaneous requests.
What We Changed in Our Process
This experiment changed how we do code reviews:
1. The "Execution Path" Requirement
Every PR that modifies critical business logic must now include a comment block showing the execution path for the happy path and at least two failure paths.
// Execution Paths:
//
// HAPPY PATH:
// 1. Check cache (miss)
// 2. Mark as processing (processing key set)
// 3. Call provider (success)
// 4. Mark as completed (completed key set, processing key deleted)
// 5. Save to DB
//
// FAILURE PATH 1: Provider fails
// 1. Check cache (miss)
// 2. Mark as processing (processing key set)
// 3. Call provider (fails)
// 4. Defer triggers: processing key deleted
// 5. Return error (no completed key set - retry is allowed)
//
// FAILURE PATH 2: Already completed
// 1. Check cache (completed key exists)
// 2. Return error immediatelyThis forces the author — and reviewers — to think through state changes.
2. Junior Engineers Review First
We inverted our review process. Now:
- Junior engineer reviews first (no assumptions, fresh eyes)
- Mid-level engineer reviews for implementation quality
- Senior engineer reviews for architecture and edge cases
Before, seniors went first and juniors rubber-stamped. The new process catches more bugs.
From our team lead: "Junior engineers ask 'what does this line do?' Seniors assume they know. That assumption is where bugs hide."
3. The "Broken on Purpose" Exercise
Once a month, a senior engineer introduces a subtle bug into staging code (with safeguards). The team has 48 hours to find it through normal code review process.
It's gamified learning. And it works — our bug detection rate in code review went from 34% to 71% in six months.
The Bigger Lesson
This experiment taught me something uncomfortable: experience creates blind spots.
The 89% who missed the bug weren't bad engineers. They were experienced engineers who:
- Looked for complex problems (race conditions, context handling)
- Focused on best practices (float precision, error wrapping)
- Stopped after finding one issue (premature satisfaction)
- Trusted their instincts (assumptions based on past bugs)
The 11% who found it were either:
- Junior enough to trace every line carefully
- Senior enough to have been burned by this exact pattern
- Disciplined enough to ignore their instincts and follow the execution path
The uncomfortable truth: Past experience helps us recognize patterns, but also trains us to skip the basics. We become faster at finding sophisticated bugs and slower at finding simple ones.
How to Avoid the 89%
Based on the interviews and our process changes, here's what actually works:
For Code Reviews:
- Trace the execution path explicitly — Don't assume, follow the code line by line
- Check failure paths as carefully as happy paths — Bugs love error handling
- Keep finding problems — The first bug you spot isn't always the critical one
- Ask "what if this fails?" at every external call
- Look for state changes before validation — Creating state early is a red flag
For Writing Code:
- Validate before mutating shared state
- Update shared state in one place (or very carefully in multiple places)
- Consider retry scenarios — What happens if this function runs twice?
- Make impossible states impossible — If a transaction is "processing," make it impossible to start processing again
- Document execution paths — Especially failure paths
For Growing as an Engineer:
- Question your instincts — The obvious answer might be wrong
- Stay curious like a junior — Never assume you know what code does
- Value correctness over cleverness — Simple bugs cost more than complex ones
- Learn from near-misses — Every bug you catch in review could have been production
Six Months Later
The actual production impact of our fix:
Before:
- Random payment failures: 0.3% of transactions
- Lost revenue: $43,000 over 6 weeks
- Customer support tickets: 89 related to "payment issues"
- Average resolution time: 2.3 days
After:
- Random payment failures: 0.001% (unrelated issues only)
- Lost revenue from this bug: $0
- Related support tickets: 3 (duplicates of legitimate declines)
- New bugs caught in review: 71% (up from 34%)
Cost of the experiment:
- 4 hours creating and posting the challenge
- 6 hours conducting interviews
- 2 days implementing process changes
Return:
- $43,000 / 6 weeks = $372,000 / year prevented losses
- Estimated 12 additional bugs caught in code review (potential impact unknown but significant)
- Team skill improvement: measurable in code review effectiveness
Not a bad return on 3 days of work.
Try It Yourself
I've created a repository with 10 more subtle bugs like this one. Each represents a real production bug caught by our team in the last year.
See how many you can spot: github.com/payment-debugging-challenge
(Just kidding — I learned my lesson from the last article. I'm not including fake URLs. But you can create your own challenge using real bugs from your codebase. It's incredibly revealing.)
The Question: Can you spot the bug in this code? Before you answer, ask yourself: am I looking for the sophisticated bug I want to find, or the simple bug that's actually there?
What's the subtlest bug you've ever missed in code review? Share in the comments — let's learn from each other's blind spots.