How We Saved 60% Off Our Rust App's Latency — With Just One perf Command

It wasn't a bug. It wasn't even bad code. It was a silent bottleneck, hiding behind abstractions — and it cost us milliseconds, thousands…

TheOpinionatedDev

~4 min read · July 10, 2025 (Updated: July 10, 2025) · Free: No

It wasn't a bug. It wasn't even bad code. It was a silent bottleneck, hiding behind abstractions — and it cost us milliseconds, thousands of times per second.

Context: A Rust Service That Was Already "Fast Enough"

We built an internal Rust service to power a high-throughput ingestion pipeline for our metrics system — think: logs, metrics, and traces flying in 24/7 from production apps.

The stack was solid:

Rust + Actix-web for the server
tokio under the hood for async IO
serde_json for parsing events
Internal in-memory queue using crossbeam
Batches flushed to Kafka every 500ms

Latency wasn't our biggest concern — until it was. On paper, we were doing fine: P99 latencies were around 9ms, P50 was below 2ms.

But we noticed something odd: CPU usage would spike under load, but throughput plateaued. When we scaled horizontally, it didn't scale linearly.

So, naturally, we profiled it.

The Problem: Fast Code, Hidden Hotspot

We'd already done the usual:

cargo bench to test bottlenecks ✅
tokio-console to trace slow spans ✅
tracing logs for task hangs ✅

Nothing screamed "fix me."

But our lead suggested something simple — and a bit old-school:

"Just run perf on it. Let's look at the flamegraph."

The Magic Perf Command

Here's exactly what we ran in production (don't worry, safe read-only mode):

sudo perf record -F 99 --call-graph dwarf -p $(pidof our-binary-name)

Then:

sudo perf script | inferno-collapse-perf | inferno-flamegraph > flamegraph.svg

💡 We used inferno, not Brendan Gregg's flamegraph tools. It's faster, written in Rust (of course), and integrates beautifully.

The Surprise: `serde_json::from_str` Was the Culprit

Here's what the flamegraph showed us — a wide, fat chunk near the top:

serde_json::de::Deserializer::parse_value
serde_json::de::Deserializer::parse_object
core::str::from_utf8

Turns out, over 45% of our CPU time was spent parsing JSON from incoming payloads.

Every incoming request was already parsed with serde_json::from_str(), and we assumed "hey, it's fast enough." We were wrong.

The Fix: Replace `serde_json` with `simd-json`

We swapped in simd-json — a Rust library that uses SIMD instructions for ultra-fast JSON parsing.

Here's the diff:

// Before:
let event: Event = serde_json::from_str(payload)?;



// After:
let mut deserializer = simd_json::Deserializer::from_str(payload)?;
let event: Event = Event::deserialize(&mut deserializer)?;

Easy win. But one catch: simd-json requires the input string to be mutable, so we had to tweak how we handled body buffers:

let mut payload = String::from_utf8_lossy(&body).to_string();
// mutable for simd-json
let mut payload_bytes = unsafe { payload.as_bytes_mut() };
// now safe to parse

Benchmarks Before & After

We re-ran the same workload after replacing serde_json.

| Metric      | Before (`serde_json`) | After (`simd-json`) |
| ----------- | --------------------- | ------------------- |
| P50 Latency | 2.1ms                 | 0.8ms               |
| P99 Latency | 9.4ms                 | 3.3ms               |
| CPU Usage   | ~88%                 | ~41%                 |
| Throughput  | 38K req/s             | 62K req/s           |

⚡ A 60% drop in average latency. 30%+ increase in throughput.

All from changing one function.

Architecture Overview

             ┌────────────┐
             │  Client    │
             └────┬───────┘
                  │
        ┌─────────▼─────────┐
        │  Rust HTTP Server │
        │  (Actix + Tokio)  │
        └─────────┬─────────┘
                  │
        ┌─────────▼────────┐
        │ Parse JSON       │  <--  was the bottleneck!
        │ Queue to buffer  │
        └─────────┬────────┘
                  │
         ┌────────▼────────┐
         │ Kafka Sink      │
         └─────────────────┘

Why This Happens in Rust More Than You Think

Rust's abstractions are famously zero-cost — until they're not.

Here's what caught us off guard:

serde_json is great, but it's not SIMD-optimized.
Parsing JSON inside the hot request path without caching or batching = bad idea.
We assumed Rust's safety meant it was "fast enough." But performance doesn't come for free — you have to measure.

Real Lessons Learned

Don't trust intuition — profile with flamegraphs.
perf is still the most underrated profiling tool for Rust in production.
Safe code can still be slow. Rust lets you control the cost, but you must observe it.
We now default to simd-json for any high-throughput service.
Make profiling part of your CI or release pipeline.

Final Thoughts

Sometimes the biggest performance win doesn't require rewriting everything.

It just takes one command:

perf record ...

And the courage to look at what your code is really doing under load.

Want the Code?

Here's a minimal version of what we did:

use simd_json::OwnedValue;
use simd_json::prelude::*;



fn parse_event(payload: &mut [u8]) -> Result<OwnedValue, simd_json::Error> {
    simd_json::to_owned_value(payload)
}

Want something more complete? Hit me up — I'll open source the perf harness we used.

#rust-programming-language #perf-command #programming #latency #engineering