A few weeks ago, a friend who runs a small e-commerce store asked me to look at his site. Sales had dropped 40% over the weekend, but nothing seemed wrong — the site loaded, the server was healthy, and his uptime dashboard was all green…

It took me 30 seconds to find the problem. His checkout page looked fine on the surface, but the JavaScript file powering the payment form had been deleted during a Friday deploy. The HTML loaded correctly. The server responded with 200 OK. But the form never rendered. For three days, every customer who tried to buy something saw a blank checkout…

His uptime monitor never noticed.

That experience got me thinking: how common is this? How many production websites are technically "up" but functionally broken — with nobody knowing?

So I ran an experiment.

The Audit

I selected 50 production websites at random from different categories: SaaS applications, agency client sites, e-commerce stores, and WordPress marketing pages. Some were small businesses, others were established companies with dev teams.

I didn't run uptime checks. I ran experience checks. For each site, I validated whether the things a real visitor depends on were actually working:

  • Do the JavaScript and CSS files load, or are they returning errors?
  • Is the SSL certificate valid, or are visitors seeing security warnings?
  • Does the URL resolve cleanly, or does it bounce through a chain of redirects?
  • Is the page showing current content, or is a CDN serving a cached version from days ago?
  • Is there actual content on the page, or is the server returning a 200 OK with an empty response?

The results: 23 out of 50 sites had at least one silent failure.

Not a server crash. Not a timeout. A failure invisible to traditional monitoring but very real to the humans trying to use the site.

What I Found

The biggest culprit: SSL certificates (8 sites)

The most common failure was the most avoidable. Eight sites had SSL issues — expired certificates, incomplete certificate chains, or certificates that didn't match the domain after a migration. These sites worked fine in some browsers but threw security warnings in others, particularly on mobile.

The scale of this problem is well-documented. Keyfactor's 2024 report found that 88% of companies experienced an unplanned outage due to an expired certificate in the past two years. Microsoft Teams went down globally for three hours because someone forgot to renew a certificate. It's a solved problem that keeps happening because nobody's watching.

The sneakiest failure: stale CDN content (5 sites)

Five sites were serving outdated content to visitors. The teams had deployed fixes, but CDN edge nodes were still serving the old version. One site was showing content from three deployments ago to European visitors, while American visitors saw the current version. The team — all US-based — had no idea.

A Catchpoint study found that over 70% of "mysterious" production bugs on major e-commerce sites were actually stale cached content. The website was working perfectly. The content delivery network was just delivering yesterday's reality.

The blank page problem: missing JavaScript (4 sites)

Four sites loaded the HTML document successfully but referenced JavaScript files that no longer existed. This is extremely common with modern single-page applications where a new deploy creates new JavaScript bundles with new filenames, but cached HTML still references the old ones.

The result is either a blank white page or a page that looks half-finished — the static HTML shell renders, but nothing interactive works. Netlify's support forums have hundreds of threads about this exact issue.

The infinite loop: redirect chains (3 sites)

Three sites had redirect configurations that conflicted with each other, creating loops. The worst case: a site that redirected from HTTP to HTTPS to www to non-www and back again, creating an infinite loop that no visitor could escape. The uptime monitor followed the first redirect, got a response, and reported "all clear."

The invisible crash: WordPress white screen (2 sites)

Two WordPress sites were returning completely empty pages after plugin updates. The server returned 200 OK — technically a successful response — but the body was empty. Both had been broken for over 48 hours without anyone noticing, because the monitoring dashboard showed green.

The quiet blocker: mixed content (1 site)

One site served images and fonts over HTTP on an HTTPS page. Modern browsers silently block this. No error. No warning. The images just don't show up.

The Common Thread

Every single one of these 23 failures passed uptime monitoring.

The server responded. The status code said 200. The dashboard said green. But the experience was broken.

This is the fundamental gap: uptime monitoring answers "is the server alive?" — which is useful, but it's the minimum useful signal. It tells you nothing about whether the application actually works for a real person.

According to industry data, 85% of website bugs are first discovered by users, not by monitoring tools. Not because monitoring doesn't exist — but because most monitoring checks the infrastructure when the failures are happening at the application layer.

The Difference Between "Up" and "Working"

After running this audit, I started thinking about what monitoring should look like. The checks that actually caught these problems were fundamentally different from ping tests:

SSL validation that checks the full certificate chain, expiry date, and domain match — not just whether a connection succeeds.

Asset integrity checks that verify every JavaScript and CSS file referenced in the HTML actually loads with the correct response.

Content fingerprinting that hashes the response body to detect when a CDN starts serving stale content after a deploy.

Redirect chain tracing that follows every hop to confirm clean resolution without loops.

Response body validation that checks whether the server is returning actual content or an empty page dressed up as a 200 OK.

Multi-region checks that catch failures specific to certain CDN edge locations, because a site can be broken in Frankfurt but perfectly fine in Virginia.

This is actually the problem I've been building Sitewatch to solve — running these checks continuously from multiple regions, alerting when something breaks at the experience layer, not just the infrastructure layer. Because that checkout bug that started this whole experiment? It could have been caught in minutes instead of three days.

What You Can Do Right Now

You don't need a tool to start checking. Open your terminal and run these against any production site:

Check your SSL: Use openssl s_client to verify the certificate expiry and chain. If the cert expired or the chain is incomplete, mobile visitors are seeing security warnings.

Validate your assets: Fetch your page source and check that every script and stylesheet it references actually returns a 200 response. If any return 404, your page is broken.

Trace your redirects: Follow the full redirect chain and count the hops. More than two is suspicious. If you see the same URL twice, you have a loop.

Fingerprint your content: Hash your homepage content before and after a deploy. If the hash doesn't change, your CDN is serving stale content.

Measure your response: Check the size of the response body. If your homepage HTML is under 500 bytes, something is returning an empty page.

If any of those checks fail, you just found a bug that your uptime monitoring has been happily ignoring.

23 out of 50. Nearly half.

If you're relying on a green status badge to tell you your site works, you might want to actually check.

I'm collecting more examples of these silent failures. If your monitoring ever missed something that your users caught first, I'd love to hear about it.