We Asked 500 Developers to Debug the Same Function. Only 11% Found the Real Bug

An experiment that revealed how we all miss the obvious — and what it teaches us about code review

Toyez

~10 min read · November 8, 2025 (Updated: November 8, 2025) · Free: No

An experiment that revealed how we all miss the obvious — and what it teaches us about code review

Last April, my team at a fintech startup faced a puzzling production issue. A critical payment processing function was failing randomly — approximately 0.3% of transactions. Not frequent enough to trigger alerts immediately, but costly enough that we'd lost $43,000 over six weeks before anyone noticed the pattern.

Three senior engineers spent two days debugging. They found nothing.

That failure haunted me. So I turned it into an experiment.

The Experiment

I took the actual buggy function, sanitized it, and posted it on five different platforms: Reddit's r/golang, a private Slack community of 2,000+ engineers, HackerNews, our company's internal engineering channel, and Twitter.

The challenge was simple:

"This Go function processes payments but fails randomly in production. Can you spot the bug?"

I gave them this code:

package payment

import (
    "context"
    "errors"
    "time"
)

type PaymentProcessor struct {
    cache map[string]*Transaction
    db    Database
}

type Transaction struct {
    ID          string
    Amount      float64
    Status      string
    ProcessedAt time.Time
}

func (p *PaymentProcessor) ProcessPayment(ctx context.Context, txnID string, amount float64) error {
    // Check cache first
    if cached, exists := p.cache[txnID]; exists {
        if cached.Status == "completed" {
            return errors.New("transaction already processed")
        }
    }
    
    // Create new transaction
    txn := &Transaction{
        ID:          txnID,
        Amount:      amount,
        Status:      "pending",
        ProcessedAt: time.Now(),
    }
    
    // Store in cache
    p.cache[txnID] = txn
    
    // Process payment
    err := p.processWithProvider(ctx, txn)
    if err != nil {
        txn.Status = "failed"
        return err
    }
    
    txn.Status = "completed"
    
    // Persist to database
    return p.db.SaveTransaction(ctx, txn)
}

func (p *PaymentProcessor) processWithProvider(ctx context.Context, txn *Transaction) error {
    // Simulated external payment provider call
    time.Sleep(100 * time.Millisecond)
    return nil
}

The results shocked me.

The Results: 500 Responses, 89% Wrong

Out of 500 developers who responded:

267 (53%) said: "Race condition — need a mutex"
89 (18%) said: "Context not being used properly"
47 (9%) said: "Float for money is bad practice"
42 (8%) said: "No error handling for SaveTransaction"
55 (11%) found the actual bug

Here's what those 11% spotted — and what the other 89% missed.

The Real Bug (It's Subtle)

The bug is in these three lines:

// Store in cache
p.cache[txnID] = txn

// Process payment
err := p.processWithProvider(ctx, txn)
if err != nil {
    txn.Status = "failed"
    return err
}

The problem: When processWithProvider fails, we update txn.Status to "failed" and return. But txn is already in the cache with status "pending". The cache never gets updated.

Next time the same transaction ID comes through (retry logic, customer refresh, etc.), this happens:

if cached, exists := p.cache[txnID]; exists {
    if cached.Status == "completed" {
        return errors.New("transaction already processed")
    }
}
// cached.Status is "pending", so we continue...

The code sees status "pending", not "completed", so it tries to process again. The payment provider rejects it (already attempted), returns an error, and we're back in the same loop.

Random failure explanation: The bug only manifests when:

First attempt fails at the provider
Customer retries before cache expires
Provider hasn't cleaned up the failed attempt

This created that mysterious 0.3% failure rate.

Interview #1: The Senior Engineer Who Spotted It

Priya, 8 years experience, currently at a payments company:

"I've been burned by this exact pattern before. Not with payments, but with order processing. The moment I saw the cache update before the provider call, I knew. You're creating state before validating state.

The giveaway was the error path. Look — when it fails, you update the object, but that object is already shared. The cache holds a pointer, not a copy. That's the killer.

I see junior devs make this mistake all the time. They think 'I'll update it later if it fails' but forget that 'later' doesn't exist if you've already shared the reference."

Why 89% Missed It

I interviewed 23 developers who got it wrong. The patterns were fascinating:

Pattern 1: They Looked for Complex Problems

Marcus, 6 years experience:

"I spent 10 minutes analyzing the mutex situation. Map writes aren't thread-safe in Go, so my brain immediately went to 'race condition.' I wrote a whole response about sync.RWMutex before I even considered the simpler logic bug.

Looking back, I was trying to show off. Finding a race condition feels like catching a sneaky bug. Catching a basic pointer mistake feels… obvious? But obvious bugs are the ones that make it to production."

The lesson: We hunt for sophisticated bugs because finding them validates our expertise. We skip the basics.

Pattern 2: They Stopped After Finding A Problem

Jennifer, 4 years experience:

"I saw float64 for money and immediately commented about precision issues. That's a real problem! But it's not the problem causing random failures. I assumed that was the answer and stopped looking.

The exercise asked for 'the bug' and I found 'a bug' and declared victory. In a real code review, I would have kept reading. But in a challenge, I wanted to be first to answer."

The lesson: Code can have multiple issues. The first problem you find isn't necessarily the critical one.

Pattern 3: They Focused on Style, Not Behavior

David, 11 years experience:

"I wrote three paragraphs about how the context parameter should be checked, how error wrapping should use fmt.Errorf with %w, and how the database interface should be properly abstracted.

I gave zero feedback on the actual logic flow. All my comments were about code quality and idioms. Those matter, but they weren't causing production failures.

I think after you reach a certain level, you default to architectural criticism. You forget to trace the actual execution path."

The lesson: Best practices matter, but execution flow matters more. Style critiques don't catch logic bugs.

Interview #2: The Junior Engineer Who Spotted It

Alex, 1.5 years experience:

"I'm new to Go, so I couldn't comment on race conditions or idiomatic patterns. I just traced what happens step by step.

I literally wrote it out:

Check cache — nothing there
Create transaction object
Put object in cache
Call provider — FAILS
Update object status to failed
Return error

Then I thought: wait, what if this function gets called again? I traced it again:

Check cache — found! Status is… still 'pending'?
Oh. We changed the object but the cache points to the same object

I was sure I was wrong because it seemed too simple. But I commented anyway."

The most powerful insight: Alex's "weakness" (being new) became a strength. No assumptions, just careful reading.

The Fix (Multiple Approaches)

The developers who found the bug suggested three approaches:

Approach 1: Update Cache on All Paths

func (p *PaymentProcessor) ProcessPayment(ctx context.Context, txnID string, amount float64) error {
    if cached, exists := p.cache[txnID]; exists {
        if cached.Status == "completed" {
            return errors.New("transaction already processed")
        }
    }
    
    txn := &Transaction{
        ID:          txnID,
        Amount:      amount,
        Status:      "pending",
        ProcessedAt: time.Now(),
    }
    
    p.cache[txnID] = txn
    
    err := p.processWithProvider(ctx, txn)
    if err != nil {
        txn.Status = "failed"
        p.cache[txnID] = txn  // ← Update cache on failure path
        return err
    }
    
    txn.Status = "completed"
    p.cache[txnID] = txn      // ← Update cache on success path
    
    return p.db.SaveTransaction(ctx, txn)
}

Pro: Simple, minimal change Con: Duplicated cache updates, easy to forget on new error paths

Approach 2: Cache After Success Only

func (p *PaymentProcessor) ProcessPayment(ctx context.Context, txnID string, amount float64) error {
    if cached, exists := p.cache[txnID]; exists {
        if cached.Status == "completed" {
            return errors.New("transaction already processed")
        }
        if cached.Status == "failed" {
            return errors.New("transaction previously failed")
        }
    }
    
    txn := &Transaction{
        ID:          txnID,
        Amount:      amount,
        Status:      "pending",
        ProcessedAt: time.Now(),
    }
    
    // Don't cache yet
    
    err := p.processWithProvider(ctx, txn)
    if err != nil {
        txn.Status = "failed"
        p.cache[txnID] = txn  // Cache the failure
        return err
    }
    
    txn.Status = "completed"
    p.cache[txnID] = txn      // Cache the success
    
    return p.db.SaveTransaction(ctx, txn)
}

Pro: Clear intent — cache represents finalized state Con: Doesn't prevent duplicate in-flight requests

Approach 3: Use Status-Specific Cache Keys (Our Choice)

func (p *PaymentProcessor) ProcessPayment(ctx context.Context, txnID string, amount float64) error {
    completedKey := txnID + ":completed"
    processingKey := txnID + ":processing"
    
    // Check if already completed
    if _, exists := p.cache[completedKey]; exists {
        return errors.New("transaction already processed")
    }
    
    // Check if currently processing
    if _, exists := p.cache[processingKey]; exists {
        return errors.New("transaction already in progress")
    }
    
    txn := &Transaction{
        ID:          txnID,
        Amount:      amount,
        Status:      "pending",
        ProcessedAt: time.Now(),
    }
    
    // Mark as processing
    p.cache[processingKey] = txn
    defer delete(p.cache, processingKey)  // Clean up processing marker
    
    err := p.processWithProvider(ctx, txn)
    if err != nil {
        txn.Status = "failed"
        return err
    }
    
    txn.Status = "completed"
    p.cache[completedKey] = txn  // Mark as completed
    
    return p.db.SaveTransaction(ctx, txn)
}

Pro: Prevents duplicate in-flight requests, clear state separation Con: More complex, more cache keys to manage

We went with Approach 3. It solved both the immediate bug and a secondary issue we hadn't noticed: duplicate simultaneous requests.

What We Changed in Our Process

This experiment changed how we do code reviews:

1. The "Execution Path" Requirement

Every PR that modifies critical business logic must now include a comment block showing the execution path for the happy path and at least two failure paths.

// Execution Paths:
// 
// HAPPY PATH:
// 1. Check cache (miss)
// 2. Mark as processing (processing key set)
// 3. Call provider (success)
// 4. Mark as completed (completed key set, processing key deleted)
// 5. Save to DB
//
// FAILURE PATH 1: Provider fails
// 1. Check cache (miss)
// 2. Mark as processing (processing key set)
// 3. Call provider (fails)
// 4. Defer triggers: processing key deleted
// 5. Return error (no completed key set - retry is allowed)
//
// FAILURE PATH 2: Already completed
// 1. Check cache (completed key exists)
// 2. Return error immediately

This forces the author — and reviewers — to think through state changes.

2. Junior Engineers Review First

We inverted our review process. Now:

Junior engineer reviews first (no assumptions, fresh eyes)
Mid-level engineer reviews for implementation quality
Senior engineer reviews for architecture and edge cases

Before, seniors went first and juniors rubber-stamped. The new process catches more bugs.

From our team lead: "Junior engineers ask 'what does this line do?' Seniors assume they know. That assumption is where bugs hide."

3. The "Broken on Purpose" Exercise

Once a month, a senior engineer introduces a subtle bug into staging code (with safeguards). The team has 48 hours to find it through normal code review process.

It's gamified learning. And it works — our bug detection rate in code review went from 34% to 71% in six months.

The Bigger Lesson

This experiment taught me something uncomfortable: experience creates blind spots.

The 89% who missed the bug weren't bad engineers. They were experienced engineers who:

Looked for complex problems (race conditions, context handling)
Focused on best practices (float precision, error wrapping)
Stopped after finding one issue (premature satisfaction)
Trusted their instincts (assumptions based on past bugs)

The 11% who found it were either:

Junior enough to trace every line carefully
Senior enough to have been burned by this exact pattern
Disciplined enough to ignore their instincts and follow the execution path

The uncomfortable truth: Past experience helps us recognize patterns, but also trains us to skip the basics. We become faster at finding sophisticated bugs and slower at finding simple ones.

How to Avoid the 89%

Based on the interviews and our process changes, here's what actually works:

For Code Reviews:

Trace the execution path explicitly — Don't assume, follow the code line by line
Check failure paths as carefully as happy paths — Bugs love error handling
Keep finding problems — The first bug you spot isn't always the critical one
Ask "what if this fails?" at every external call
Look for state changes before validation — Creating state early is a red flag

For Writing Code:

Validate before mutating shared state
Update shared state in one place (or very carefully in multiple places)
Consider retry scenarios — What happens if this function runs twice?
Make impossible states impossible — If a transaction is "processing," make it impossible to start processing again
Document execution paths — Especially failure paths

For Growing as an Engineer:

Question your instincts — The obvious answer might be wrong
Stay curious like a junior — Never assume you know what code does
Value correctness over cleverness — Simple bugs cost more than complex ones
Learn from near-misses — Every bug you catch in review could have been production

Six Months Later

The actual production impact of our fix:

Before:

Random payment failures: 0.3% of transactions
Lost revenue: $43,000 over 6 weeks
Customer support tickets: 89 related to "payment issues"
Average resolution time: 2.3 days

After:

Random payment failures: 0.001% (unrelated issues only)
Lost revenue from this bug: $0
Related support tickets: 3 (duplicates of legitimate declines)
New bugs caught in review: 71% (up from 34%)

Cost of the experiment:

4 hours creating and posting the challenge
6 hours conducting interviews
2 days implementing process changes

Return:

$43,000 / 6 weeks = $372,000 / year prevented losses
Estimated 12 additional bugs caught in code review (potential impact unknown but significant)
Team skill improvement: measurable in code review effectiveness

Not a bad return on 3 days of work.

Try It Yourself

I've created a repository with 10 more subtle bugs like this one. Each represents a real production bug caught by our team in the last year.

See how many you can spot: github.com/payment-debugging-challenge

(Just kidding — I learned my lesson from the last article. I'm not including fake URLs. But you can create your own challenge using real bugs from your codebase. It's incredibly revealing.)

The Question: Can you spot the bug in this code? Before you answer, ask yourself: am I looking for the sophisticated bug I want to find, or the simple bug that's actually there?

What's the subtlest bug you've ever missed in code review? Share in the comments — let's learn from each other's blind spots.

#golang #programming #software-engineering #code-review #software-development

< Go to the original