When you're running a performance-critical service, P99 latency isn't just a number — it's the heartbeat of user experience. We had a Go service that was fast, reliable, and battle-tested. But when we looked closer, the long tail told another story. Our P99 latency was consistently higher than we wanted — hovering in the 700–800ms range. Not terrible, but not great either.

We wanted better. And we got it. Here's how we cut that number in half.

Step 1: Understand the Tail

P99 latency is where all your architectural shortcuts show up to party. Unlike average or median latencies, it highlights the worst-case scenario that still happens regularly. To understand what was going on, we turned to distributed tracing and request sampling.

We integrated OpenTelemetry and started correlating spans with log entries. Patterns quickly emerged:

  • Spikes in third-party API calls
  • Cache misses triggering fallback database queries
  • Serialization delays for large payloads

Step 2: Cache Smarter, Not Harder

Caching was already in place — but it was naive. We had a one-size-fits-all TTL, no stampede protection, and inconsistent key invalidation.

So we:

  • Introduced cache warming on deployment
  • Used single-flight to prevent duplicate upstream calls
  • Added background refresh for expensive keys

These changes alone dropped P99 latency by nearly 200ms, especially under load.

Step 3: Rethink Goroutine Spawning

// Before: spawning goroutines freely
http.HandleFunc("/handler", func(w http.ResponseWriter, r *http.Request) {
    go queryDB()
    go callExternalAPI()
})

// After: structured concurrency with worker pool
func handleRequest(job Job) {
    jobQueue <- job
}

func worker() {
    for job := range jobQueue {
        process(job)
    }
}

for i := 0; i < maxWorkers; i++ {
    go worker()
}

Every request handler spawned goroutines to parallelize DB queries and external calls. Great in theory, but under pressure, this led to unpredictable scheduling delays and context switches.

We shifted to structured concurrency using worker pools and limited fan-out.

Less chaos = more predictable tail latency.

Step 4: Tune the Database Calls

Next, we tackled query time. Our ORM-generated queries were doing more work than needed. We:

  • Added tighter column selection
  • Used query hints where applicable
  • Indexed based on actual usage patterns from query logs

Suddenly, the slowest 1% of DB responses went from 400ms+ to under 150ms.

Step 5: Serialize Leaner

Large JSON payloads were another culprit. We optimized structs to exclude unused fields, switched to jsoniter, and compressed outbound responses where applicable.

This gave us another 50–100ms win in some endpoints.

Results

  • P99 Latency: From ~750ms to ~350ms
  • System throughput: Unchanged
  • Error rates: Stable
  • Customer feedback: Tangibly better

Lessons Learned

  • Always look at your tail, not just your averages
  • Distributed tracing reveals what metrics hide
  • Concurrency should be predictable, not wild-west
  • Small inefficiencies compound — profile everything

Final Word

Go gave us the foundation for a fast service. But the tools, techniques, and attention to detail turned "fast enough" into "damn fast."

If you're living in the long tail, you're not alone. Start measuring, start asking questions, and don't be afraid to rebuild the slow parts. The rewards are real.

Got your own latency wins or war stories? Let's hear them.