Site Deduplication and Deep Scan Convergence in Web Security Scanning

When your scanner runs the same target five times, the problem isn't the scanner — it's the entry-point judgment.

Background

A common pain point in internet asset discovery and security scanning is that scan tasks scale linearly with asset volume, yet a significant portion of them are duplicates. A single application may be exposed through multiple ports, subdomains, or even different domain names, and the scanner treats every entry point equally, running the full deep-scan pipeline on each one.

This article summarizes a battle-tested strategy for site deduplication and deep scan convergence. The core idea: insert a judgment layer before deep scanning to identify which entry points are duplicates or low-value, and only run deep scans on genuinely independent sites.

1. The Problem

In a typical asset discovery pipeline, the workflow looks like this:

Discover IP addresses, domain names, and their open ports
Verify web reachability for each entry point
Persist reachable entry points as site results
Feed all site results into deep scanning (URL crawling, directory brute-forcing, vulnerability scanning, etc.)

The key characteristic of this pipeline: it only checks "is it reachable?", not "is it the same site?" As long as an entry point returns an HTTP response, it's considered reachable and gets persisted as a site result for deep scanning.

This leads to several categories of duplication:

ScenarioExampleSame domain, different portsa.example.com:443 and a.example.com:8443 point to the same applicationSame IP, different subdomainsa.example.com and b.example.com both resolve to 1.2.3.4:443Domain entry vs. direct IP accesswww.a.com:80 and 1.1.1.1:80 point to the same siteDifferent domains, same resolved IPa.example.com and b.other.com resolve to the same IP with identical content

Once these entry points are identified as independent sites, the deep scan treats each as a separate target — even when they return the same page content.

2. Impact Analysis

The direct consequences of duplicate sites entering deep scanning:

Amplified scan costs: Task duration, resource consumption, and queue pressure multiply
Noise pages rescanned repeatedly: Error pages, WAF block pages, and default pages generate massive amounts of dirty data
Merge risk: Overly aggressive deduplication logic may merge distinct business entry points that should remain separate

From an asset inventory perspective, recording "which entry points ever responded" is valuable. But from a deep-scan execution perspective, the more critical question is: "which entry points are worth scanning independently?"

Core contradiction: Without a unified deduplication and convergence mechanism, valid assets, duplicate entry points, and noise pages all flow into deep scanning without being stratified.

3. Design Principles

Three fundamental principles guide the convergence strategy:

Asset inventory unchanged: All entry points are preserved for asset visibility, auditing, and traceability
Execution scope converges: What's optimized is "whether to deep scan again", not "whether to keep this asset"
Conservative convergence: Automatic convergence only happens when duplication is clearly proven or noise is clearly identified; otherwise, keep scanning

In short: not "discover fewer assets", but "run fewer无效 deep scans."

Convergence Boundaries

To control false-positive risk, convergence is only permitted when at least one of these conditions is met:

Standard redirect: HTTP-to-HTTPS 3xx redirect with a clearly resolvable final destination
Clear noise signature: Response content is a generic gateway error page, WAF block page, CDN default page, etc.
Equivalence verified: Multi-dimensional comparison of HTTP response data yields consistent results

The following scenarios do not qualify for convergence and default to independent deep scanning:

Page similarity alone — same template may serve different business logic
Same parent domain alone — subdomains may point to entirely different applications
Same resolved IP alone — a single server may host multiple independent services
Direct IP access — even if content matches the domain entry, direct IP access may bypass virtual host routing to a different backend

4. Three-Layer Progressive Strategy

Before a site enters deep scanning, a unified judgment layer decides which entry points proceed and which skip redundant execution. The strategy has three layers, applied progressively:

Layer 1: Noise Filtering

For generic gateway error pages, WAF block pages, CDN default pages, reverse proxy error pages, and other noise: assets are preserved, but they do not enter deep scanning.

These pages only indicate "the entry point exists and is reachable" — they don't represent real business sites. Deep scanning them yields no value and only produces dirty data.

Additional nuances:

Explicit maintenance pages: Treated as noise directly
Default welcome pages (e.g., Welcome to nginx!, Tomcat default page, Apache test page): Not treated as hard noise, but demoted to low-information sites; if equivalence verification later shows they match another entry point semantically, execution-layer convergence is allowed

Layer 2: Redirect Merging

For HTTP-to-HTTPS redirects where the redirect chain is clear and the final destination is unambiguous, only the side that actually serves business content proceeds to deep scanning. The redirect source is not rescanned.

Layer 3: Equivalence Verification

For entry points under the same domain with multiple ports, different subdomains under the same parent domain, or different domains sharing the same IP, equivalence verification is performed. Only one representative entry point proceeds to deep scanning; the rest retain their records but skip redundant execution.

Equivalence verification does not rely on a separate fingerprinting capability. Instead, it performs multi-dimensional comparison based on HTTP response data already collected in the pipeline. Five signals are compared — all must matchto declare two entry points as the same site:

DimensionDescriptionJudgment LogicResponse header signaturesPrioritize Last-Modified and ETag — these are resource-level precise fingerprints; Server, X-Powered-By, etc. serve as auxiliary signalsMultiple entry points returning the same Last-Modified/ETag for the same path indicates they serve the same physical resourceRedirect chainWhether the final destination is identicale.g., both redirect to the same HTTPS URLPage structure featuresCompare <title>, <meta description>, <meta keywords>, <meta generator>These tags are rendered by backend templates; different business systems almost never produce identical combinations by coincidenceKey resource pathsWhether the paths of core JS/CSS resources referenced by the page are identicalResource paths with hashes or version numbers can directly prove they point to the same code deploymentKey path access resultsWhether the response status codes and content for high-presence paths like /robots.txt, /favicon.ico, /sitemap.xml are consistentOne entry returning 200 and another returning 403 indicates routing or permission differences

Decision logic:

All five dimensions match → Declared as the same site; only one representative proceeds to deep scan
Key path access results differ → Routing or permission differences exist; keep as separate sites
Other dimensions differ → Different backend systems or access surfaces; keep as separate sites

Semantic Normalization for Structured Responses

For XML/JSON structured shell pages, raw body hashing cannot be used directly. Instead, "semantic normalization" is applied: stable fields (e.g., code, message, resource, success, data) are extracted first, while dynamic fields (e.g., RequestId, TraceId, ServerTime, timestamp) are ignored. This prevents identical error pages from being misclassified as different sites due to request-level random values.

5. Scenario Summary

ScenarioConvergence RuleWhy Convergence May Not ApplySame domain, multiple portsAll five dimensions match → mergeDifferent ports may serve portal vs. admin backendSame parent domain, different subdomainsAll five dimensions match → mergeSubdomains may point to entirely different businessesDifferent domains, same IPOnly normal business pages allowed; requires five-dimension match + same IPDifferent domains may be multi-tenant, mirror sites, or white-label sitesDomain entry vs. direct IP accessNo automatic convergence by defaultDirect IP access may bypass virtual host routing to a different backend

Representative Entry Point Selection

When merging, which entry point to keep for deep scanning follows "business first, stability first":

Prefer the entry point that actually serves business content over pure redirect sources
Prefer more standard, stable entry points (e.g., HTTPS over HTTP)
Prefer the domain entry point closest to the actual public-facing main entry

6. Decision Flow

flowchart TD
    A["Enter candidate pool"] --> B{"Is it a noise page?"}
    B -->|Yes| B1["Preserve asset, skip deep scan"]
    B -->|No| C{"Is it a standard redirect?"}
    C -->|Yes| C1["Keep the business-serving side"]
    C -->|No| D{"Scenario type?"}
    D -->|Same domain, multi-port| E{"Five-dimension comparison: all match?"}
    D -->|Same parent domain, different subdomains| E
    D -->|Domain vs. direct IP access| F["No automatic convergence"]
    E -->|Yes| G["Declared as same site"]
    E -->|No| H["Keep as separate sites"]

flowchart TD
    A["Enter candidate pool"] --> B{"Is it a noise page?"}
    B -->|Yes| B1["Preserve asset, skip deep scan"]
    B -->|No| C{"Is it a standard redirect?"}
    C -->|Yes| C1["Keep the business-serving side"]
    C -->|No| D{"Scenario type?"}
    D -->|Same domain, multi-port| E{"Five-dimension comparison: all match?"}
    D -->|Same parent domain, different subdomains| E
    D -->|Domain vs. direct IP access| F["No automatic convergence"]
    E -->|Yes| G["Declared as same site"]
    E -->|No| H["Keep as separate sites"]

7. Expected Benefits

Reduced duplicate scan costs: Shorter task duration, lower resource consumption, less queue pressure
Improved result quality: Fewer noise pages entering deep scanning, less dirty and duplicate data
Controlled miss-scan risk: Conservative convergence avoids false merges, ensuring real business entry points are not overlooked
Unified rule foundation: Lays the groundwork for downstream site governance, entry-point management, and scan strategy optimization

Conclusion

The core idea of site deduplication and deep scan convergence can be summarized in one sentence: insert a judgment layer between "reachable" and "worth deep scanning."

The three-layer progressive strategy — noise filtering, redirect merging, and equivalence verification — addresses three categories of problems: valueless pages, redundant redirects, and equivalent entry points. Each layer is designed to be conservative: better to over-scan than to miss something, but decisively converge when duplication is clearly proven.

The value of this strategy goes beyond saving scan resources. It makes scan results cleaner and more focused — security teams no longer see five duplicate reports for the same application, but instead see the truly independent assets that deserve attention.

Contents