It wasn't a bug. It wasn't even bad code. It was a silent bottleneck, hiding behind abstractions — and it cost us milliseconds, thousands of times per second.

Context: A Rust Service That Was Already "Fast Enough"

We built an internal Rust service to power a high-throughput ingestion pipeline for our metrics system — think: logs, metrics, and traces flying in 24/7 from production apps.

The stack was solid:

  • Rust + Actix-web for the server
  • tokio under the hood for async IO
  • serde_json for parsing events
  • Internal in-memory queue using crossbeam
  • Batches flushed to Kafka every 500ms

Latency wasn't our biggest concern — until it was. On paper, we were doing fine: P99 latencies were around 9ms, P50 was below 2ms.

But we noticed something odd: CPU usage would spike under load, but throughput plateaued. When we scaled horizontally, it didn't scale linearly.

So, naturally, we profiled it.

The Problem: Fast Code, Hidden Hotspot

We'd already done the usual:

  • cargo bench to test bottlenecks ✅
  • tokio-console to trace slow spans ✅
  • tracing logs for task hangs ✅

Nothing screamed "fix me."

But our lead suggested something simple — and a bit old-school:

"Just run perf on it. Let's look at the flamegraph."

The Magic Perf Command

Here's exactly what we ran in production (don't worry, safe read-only mode):

sudo perf record -F 99 --call-graph dwarf -p $(pidof our-binary-name)

Then:

sudo perf script | inferno-collapse-perf | inferno-flamegraph > flamegraph.svg

💡 We used inferno, not Brendan Gregg's flamegraph tools. It's faster, written in Rust (of course), and integrates beautifully.

The Surprise: serde_json::from_str Was the Culprit

Here's what the flamegraph showed us — a wide, fat chunk near the top:

serde_json::de::Deserializer::parse_value
serde_json::de::Deserializer::parse_object
core::str::from_utf8

Turns out, over 45% of our CPU time was spent parsing JSON from incoming payloads.

Every incoming request was already parsed with serde_json::from_str(), and we assumed "hey, it's fast enough." We were wrong.

The Fix: Replace serde_json with simd-json

We swapped in simd-json — a Rust library that uses SIMD instructions for ultra-fast JSON parsing.

Here's the diff:

// Before:
let event: Event = serde_json::from_str(payload)?;



// After:
let mut deserializer = simd_json::Deserializer::from_str(payload)?;
let event: Event = Event::deserialize(&mut deserializer)?;

Easy win. But one catch: simd-json requires the input string to be mutable, so we had to tweak how we handled body buffers:

let mut payload = String::from_utf8_lossy(&body).to_string();
// mutable for simd-json
let mut payload_bytes = unsafe { payload.as_bytes_mut() };
// now safe to parse

Benchmarks Before & After

We re-ran the same workload after replacing serde_json.

| Metric      | Before (`serde_json`) | After (`simd-json`) |
| ----------- | --------------------- | ------------------- |
| P50 Latency | 2.1ms                 | 0.8ms               |
| P99 Latency | 9.4ms                 | 3.3ms               |
| CPU Usage   | ~88%                 | ~41%                 |
| Throughput  | 38K req/s             | 62K req/s           |

A 60% drop in average latency. 30%+ increase in throughput.

All from changing one function.

Architecture Overview

             ┌────────────┐
             │  Client    │
             └────┬───────┘
                  │
        ┌─────────▼─────────┐
        │  Rust HTTP Server │
        │  (Actix + Tokio)  │
        └─────────┬─────────┘
                  │
        ┌─────────▼────────┐
        │ Parse JSON       │  <--  was the bottleneck!
        │ Queue to buffer  │
        └─────────┬────────┘
                  │
         ┌────────▼────────┐
         │ Kafka Sink      │
         └─────────────────┘

Why This Happens in Rust More Than You Think

Rust's abstractions are famously zero-cost — until they're not.

Here's what caught us off guard:

  1. serde_json is great, but it's not SIMD-optimized.
  2. Parsing JSON inside the hot request path without caching or batching = bad idea.
  3. We assumed Rust's safety meant it was "fast enough." But performance doesn't come for free — you have to measure.

Real Lessons Learned

  • Don't trust intuition — profile with flamegraphs.
  • perf is still the most underrated profiling tool for Rust in production.
  • Safe code can still be slow. Rust lets you control the cost, but you must observe it.
  • We now default to simd-json for any high-throughput service.
  • Make profiling part of your CI or release pipeline.

Final Thoughts

Sometimes the biggest performance win doesn't require rewriting everything.

It just takes one command:

perf record ...

And the courage to look at what your code is really doing under load.

Want the Code?

Here's a minimal version of what we did:

use simd_json::OwnedValue;
use simd_json::prelude::*;



fn parse_event(payload: &mut [u8]) -> Result<OwnedValue, simd_json::Error> {
    simd_json::to_owned_value(payload)
}

Want something more complete? Hit me up — I'll open source the perf harness we used.