Stop Measuring Latency with Wall-Clock Time

A small misunderstanding about system time leads to broken dashboards and false alerts.

Vimukthi Mayadunne

Coffee☕ And Code💚

· ~4 min read · February 1, 2026 (Updated: February 1, 2026) · Free: Yes

Technology

As engineers, we like to believe abstractions protect us from low-level details. Time APIs, metrics libraries, monitoring stacks — they all look simple. You call a function, emit a value, and trust Grafana or Prometheus to do the rest. But production has a way of reminding us that abstractions leak, especially when it comes to time.

"It's Just Time in Milliseconds — What Could Go Wrong?"

Most developers default to System.currentTimeMillis() when measuring how long something takes.

It feels reasonable:

It returns time
It's easy to understand
It's available everywhere
It doesn't look dangerous

So code like this sneaks into services, interceptors, and middleware:

long start = System.currentTimeMillis();
// do some work
long duration = System.currentTimeMillis() - start;

This code compiles. It runs. It even works most of the time. Which is exactly why it's dangerous.

When Dashboards Start Showing the Impossible

At some point — usually not right after a deploy — dashboards start acting strange.

Latency graphs show negative values
Rates dip below zero
Histograms look distorted for a few minutes
Alerts fire or fail without an obvious reason

Grafana is blamed. Prometheus queries are rewritten. Scrape intervals are questioned. Very rarely does anyone look at the time source used to produce the metric.

The Hidden Assumption in currentTimeMillis()

The code above assumes one thing: That wall-clock time always moves forward. That assumption is false. currentTimeMillis() represents system clock time, not elapsed time. The system clock can change while your code is running.

So this perfectly valid-looking code:

long start = System.currentTimeMillis();
doWork();
long duration = System.currentTimeMillis() - start;

Can produce:

A very large spike
Or a negative value

Not because your logic is wrong — but because time moved underneath you.

Why the Clock Moves (Briefly)

This happens more often than people think:

NTP corrections adjust the clock forward or backward to fix drift.
Manual clock changes happen during maintenance, snapshot restores, or configuration fixes.
VM and container time adjustments occur during pauses, resumes, or host migrations.

None of these events notifies your application. They just change what "now" means.

`nanoTime()` Is Not a Better Clock — It's Not a Clock at All

This is where most explanations go wrong and, frankly, where a lot of engineers pick up the wrong mental model.

System.nanoTime() is often described as a more precise version of currentTimeMillis(). Same thing, just finer resolution. That framing is convenient — and completely incorrect. nanoTime() does not represent time in the way humans, logs, or monitoring systems understand it. It doesn't tell you when something happened, and it was never meant to.

What nanoTime() actually gives you is a monotonic counter maintained by the operating system. It only moves forward. It is not affected by NTP corrections, manual clock changes, VM pauses, or hypervisor time adjustments. It has no epoch, no time zone, and no relationship to wall-clock time. The starting point is intentionally undefined, which is why a single nanoTime() value is meaningless on its own.

This design is not a limitation — it is the feature. Completely disconnecting elapsed time from the system clock, nanoTime() gives you something that wall-clock time never promised: stability. When you subtract two nanoTime() values, the result represents how much time actually passed, regardless of what the system clock did in between. No backward jumps, no negative durations, no surprises showing up in your metrics.

What `nanoTime()` Is Designed For

nanoTime() exists for one reason: measuring elapsed time safely.

This is the correct usage:

long start = System.nanoTime();
doWork();
long durationNs = System.nanoTime() - start;

Here's what you get:

No negative values
No clock jumps
No surprises during VM pauses
No broken histograms

This is what monitoring systems expect.

How This Impacts Metrics and Monitoring

Monitoring systems like Prometheus assume that:

Durations are non-negative
Time flows forward
Samples belong to consistent windows

When wall-clock time breaks these assumptions:

Histograms get corrupted
Rate calculations briefly explode or collapse
Dashboards show values that should be physically impossible

The monitoring stack isn't broken. It's faithfully reporting bad math.

This Bug Is Dangerous Because It's Silent

There is no stack trace for this. No warning. No test failure.No obvious reproduction.

It often appears:

After long uptimes
In virtualized environments
Under load
Only in production

And it usually affects observability code — the worst place for bugs to hide.

The Only Rule That Actually Matters

Forget precision. Forget units. Remember this instead:

Wall-clock time is for timestamps.
Monotonic time is for durations.

Never mix them. Never assume the system clock is stable. If your metric answers "how long did this take?", wall-clock time has no business being involved.

Final Thought

Most monitoring failures aren't caused by complex distributed systems issues. They're caused by small misunderstandings that survive code review because "it looks fine". Time is one of those misunderstandings. If you want reliable dashboards, reliable alerts, and reliable incident response, you need to be intentional about where your time comes from. Because the system clock does not care about your assumptions — and your dashboards will happily reflect that.

#java #programming #grafana #prometheus #bugs