It wasn't a bug. It wasn't even bad code. It was a silent bottleneck, hiding behind abstractions — and it cost us milliseconds, thousands of times per second.
Context: A Rust Service That Was Already "Fast Enough"
We built an internal Rust service to power a high-throughput ingestion pipeline for our metrics system — think: logs, metrics, and traces flying in 24/7 from production apps.
The stack was solid:
- Rust + Actix-web for the server
tokiounder the hood for async IOserde_jsonfor parsing events- Internal in-memory queue using crossbeam
- Batches flushed to Kafka every 500ms
Latency wasn't our biggest concern — until it was. On paper, we were doing fine: P99 latencies were around 9ms, P50 was below 2ms.
But we noticed something odd: CPU usage would spike under load, but throughput plateaued. When we scaled horizontally, it didn't scale linearly.
So, naturally, we profiled it.
The Problem: Fast Code, Hidden Hotspot
We'd already done the usual:
cargo benchto test bottlenecks ✅tokio-consoleto trace slow spans ✅tracinglogs for task hangs ✅
Nothing screamed "fix me."
But our lead suggested something simple — and a bit old-school:
"Just run
perfon it. Let's look at the flamegraph."
The Magic Perf Command
Here's exactly what we ran in production (don't worry, safe read-only mode):
sudo perf record -F 99 --call-graph dwarf -p $(pidof our-binary-name)Then:
sudo perf script | inferno-collapse-perf | inferno-flamegraph > flamegraph.svg💡 We used inferno, not Brendan Gregg's flamegraph tools. It's faster, written in Rust (of course), and integrates beautifully.
The Surprise: serde_json::from_str Was the Culprit
Here's what the flamegraph showed us — a wide, fat chunk near the top:
serde_json::de::Deserializer::parse_value
serde_json::de::Deserializer::parse_object
core::str::from_utf8Turns out, over 45% of our CPU time was spent parsing JSON from incoming payloads.
Every incoming request was already parsed with serde_json::from_str(), and we assumed "hey, it's fast enough." We were wrong.
The Fix: Replace serde_json with simd-json
We swapped in simd-json — a Rust library that uses SIMD instructions for ultra-fast JSON parsing.
Here's the diff:
// Before:
let event: Event = serde_json::from_str(payload)?;
// After:
let mut deserializer = simd_json::Deserializer::from_str(payload)?;
let event: Event = Event::deserialize(&mut deserializer)?;Easy win. But one catch: simd-json requires the input string to be mutable, so we had to tweak how we handled body buffers:
let mut payload = String::from_utf8_lossy(&body).to_string();
// mutable for simd-json
let mut payload_bytes = unsafe { payload.as_bytes_mut() };
// now safe to parseBenchmarks Before & After
We re-ran the same workload after replacing serde_json.
| Metric | Before (`serde_json`) | After (`simd-json`) |
| ----------- | --------------------- | ------------------- |
| P50 Latency | 2.1ms | 0.8ms |
| P99 Latency | 9.4ms | 3.3ms |
| CPU Usage | ~88% | ~41% |
| Throughput | 38K req/s | 62K req/s |⚡ A 60% drop in average latency. 30%+ increase in throughput.
All from changing one function.
Architecture Overview
┌────────────┐
│ Client │
└────┬───────┘
│
┌─────────▼─────────┐
│ Rust HTTP Server │
│ (Actix + Tokio) │
└─────────┬─────────┘
│
┌─────────▼────────┐
│ Parse JSON │ <-- was the bottleneck!
│ Queue to buffer │
└─────────┬────────┘
│
┌────────▼────────┐
│ Kafka Sink │
└─────────────────┘Why This Happens in Rust More Than You Think
Rust's abstractions are famously zero-cost — until they're not.
Here's what caught us off guard:
serde_jsonis great, but it's not SIMD-optimized.- Parsing JSON inside the hot request path without caching or batching = bad idea.
- We assumed Rust's safety meant it was "fast enough." But performance doesn't come for free — you have to measure.
Real Lessons Learned
- Don't trust intuition — profile with flamegraphs.
perfis still the most underrated profiling tool for Rust in production.- Safe code can still be slow. Rust lets you control the cost, but you must observe it.
- We now default to
simd-jsonfor any high-throughput service. - Make profiling part of your CI or release pipeline.
Final Thoughts
Sometimes the biggest performance win doesn't require rewriting everything.
It just takes one command:
perf record ...And the courage to look at what your code is really doing under load.
Want the Code?
Here's a minimal version of what we did:
use simd_json::OwnedValue;
use simd_json::prelude::*;
fn parse_event(payload: &mut [u8]) -> Result<OwnedValue, simd_json::Error> {
simd_json::to_owned_value(payload)
}Want something more complete? Hit me up — I'll open source the perf harness we used.