Release Decisions as Audit Evidence

Securing 'The Chain of Custody' in Software Releases. Yes, you read that right.

André Ataíde

~16 min read · May 10, 2026 (Updated: May 10, 2026) · Free: Yes

On 7 March 2017, the Apache Software Foundation disclosed CVE-2017–5638, an OGNL injection vulnerability in Struts2 allowing unauthenticated remote code execution. A patched version shipped the next day. Two days later, Equifax's security team distributed the advisory to internal asset owners. A vulnerability scan ran on 15 March and reported clean — not because the scanner failed, but because the affected portal was not in the scanned inventory. On 13 May, an attacker entered through that same unpatched Struts component. The intrusion went undetected for seventy-six days. On 7 September 2017, Equifax disclosed the breach. The personal records of 147.9 million people were affected.

The House Oversight Committee report, published in December 2018, traces a chain of operational failures: an incomplete asset inventory, a scanner whose scope did not match deployed reality, an absent verification step that would have confirmed patch application, and legacy documentation insufficient for incident response. That report reads, deservedly, as a case study in operational failure.

There is a less-discussed dimension. The House Oversight Committee report does not establish whether Equifax made a conscious decision to defer patching the ACIS portal or simply never deliberated on it. That indistinction is itself the problem. Without a decision artefact — a record of who assessed the vulnerability against that specific asset, on what evidence, with what authorisation — the two states are forensically identical after the fact. A deliberate and documented deferral under defined compensating conditions is a defensible risk decision. An undocumented non-event is not. The report cannot tell the difference, and neither could Equifax when it needed to.

The pattern is not exclusive to Equifax. In December 2021, when CVE-2021–44228 (Log4Shell) reached widespread disclosure, organisations across sectors faced thousands of releases where Log4j 2.x was an embedded dependency. Many of those releases continued. In most cases, the decision to continue was not recorded as a decision. The signal was present. The trail was not.

Same gap, opposite direction: the supply chain incidents

Recent supply chain compromises exhibit a related but inverted pattern. SolarWinds Orion was distributed in late 2020 carrying a backdoor inserted into the build process; downstream consumers — including federal agencies and Fortune 500 firms — accepted updates whose integrity was established by signing keys controlled by the compromised pipeline. The signature was valid. The artefact was not. Verifying the signature confirmed only that the update came from SolarWinds; it confirmed nothing about whether the release decision behind it was sound, because the consumer had no independent record of how the producer's decision had been made.

Five years later, the Shai-Hulud campaign demonstrates the same shape with newer surfaces. Compromises across PyPI (LiteLLM and others), GitHub Actions (Trivy, KICS/Checkmarx repository workflows) and CLI tooling (Bitwarden CLI) have leveraged the trust that consumers extend to maintainer pipelines. In each Shai-Hulud incident, the package or workflow ran with the maintainer's credentials and signing infrastructure. Detection at the consumer end depended on heuristics — install scripts that should not exist, build outputs diverging from expected hashes — rather than on a verifiable chain establishing what decision was made on the producer side.

Equifax in 2017 and the supply chain campaigns of 2020–2026 point to the same structural problem from different directions. Equifax shows what happens when a release continues without a producer-side decision artefact. SolarWinds and Shai-Hulud show what happens when a decision artefact exists but cannot be independently verified at the consuming end. Both are failures of the chain of custody around the release decision. Both produce damage measurable in years and billions.

"Chain of custody" is borrowed from digital forensics and law, where it refers to the documented, tamper-evident handling of evidence from seizure to courtroom. GRC practitioners will recognise the same concern under different names — change control, SDLC integrity, third-party risk management — but those terms carry procedural connotations that obscure the technical question at stake: not whether a process was followed, but whether what reached the consumer is demonstrably what the producer decided to release. The term is used here in that narrower, harder sense.

Two languages meet at the release moment

The release decision sits at a peculiar place in the software lifecycle. In every other phase, two ways of thinking can coexist without producing the same artefact. Threat modelling is engineering work; ISMS scope review is governance work; both can address the same system without their outputs ever meeting on the page. The release moment is different. A release is a transition of state — code that was internal becomes available to consumers, in production, with consequences that propagate. That transition has to be recorded in a way meaningful to engineering, and in a way meaningful to audit. Until the artefact exists, those two recordings are separate stories about the same event.

Engineering's requirement is composition. CVSS speaks to severity. EPSS speaks to exploitation likelihood. Reachability analysis tells us whether the vulnerable code path is actually invoked. Asset criticality, network exposure, compensating controls — each is a real factor in technical risk, and a useful release decision composes them rather than picking one. Engineering languages have grown around this composition for two decades. The novelty is not in computing the decision. It is in requiring the decision to be auditable.

Audit's requirement is different in kind. ISO 27001 Annex A.5.20 and A.8.32 establish that significant changes — and a release qualifies — require evidence of who authorised what, against what evidence, with what justification. NIS2 Article 21 obliges essential and important entities to maintain technical and organisational measures whose effectiveness is demonstrable on inspection. DORA Chapter III holds financial entities to similar obligations on ICT risk management. None of these frameworks specify how the decision is technically computed; what they specify is that the decision must leave a trail.

Assessments structured around frameworks like OWASP SAMM tend to surface a familiar asymmetry across mid-maturity organisations: the documents that constitute governance — policies, change procedures, incident playbooks — exist and are formally adopted, while the practices reported by the people who actually ship software do not match what those documents describe. The asymmetry is not specific to security. The literature on the policy-practice gap (Niemimaa & Niemimaa 2017; Hu, Hart & Cooke 2007; Albrechtsen & Hovden 2009) documents the same pattern across organisational settings going back decades. What makes the release moment specific is that it is the point where the asymmetry becomes externally visible. An incident makes the gap legible to regulators. An audit makes it legible to the auditor. Without a release decision artefact, the organisation has nothing to show on either occasion except documents that did not, in fact, govern the release.

Three classes of tooling, each solving a real problem

The argument that follows is not a comparison and not a critique. The tools below are mature, widely deployed, and resolve concrete needs in the contexts for which they were designed. The observation is narrower: when the release decision has to satisfy both sides of the boundary outlined above — engineering composition, audit trail, both at once — none of these classes was designed for that combined task. Reading "not designed for X" as "bad at X" misreads the argument. Each tool is good at what it does, and the combined boundary requirement is something none of them attempted.

Vulnerability gates that run in the build pipeline — Trivy with --fail-on, Grype, Snyk's CI integrations, GitHub Advanced Security's gating — solve detection-and-stop. They consume scan output, apply a threshold, and either pass or fail the build. Their domain is fast feedback at low overhead. They are intentionally binary because pipelines cannot wait for human deliberation on every CVE. The decision they emit is "block" or "do not block", recorded as an exit code. That output is not a decision artefact in the audit sense — it is a signal to the build system. A pipeline cannot pause for fifteen minutes while a CISO inspects context, and these tools were never trying to ask it to.

Policy-as-code engines — OPA with Conftest, Kyverno, Cedar — solve declarative governance over arbitrary inputs. A policy is written in a high-level language, evaluated against structured input, and produces a decision with a reason chain. Their range is broad: image policies, IaC policies, API policies, release policies, anything that fits a declarative rule. Generality is the design choice. What policy-as-code does not provide on its own is the risk model for the specific case of vulnerabilities — the user supplies that. A useful CVE-aware OPA policy ends up importing CVSS scoring, EPSS lookup, exposure context, compensating controls. The engine decides; the model is your responsibility. The decision, once again, is a verdict, and its persistence as evidence is left to whatever surrounds the engine.

Governance, risk and compliance platforms — ServiceNow GRC, RSA Archer, OneTrust — solve the audit trail problem head-on. Workflows, approvers, evidence attachments, control mappings, regulator-ready reports. Persistence is the product. Where these platforms have historically struggled is the inverse: pipeline integration, latency, and grounding decisions in technical risk data rather than human-entered fields. A release decision in a GRC platform tends to be a workflow that humans complete — accurate, traceable, and slow. Pipeline-native they are not.

Every functioning programme combines pieces from these three classes. The handover between pieces is where things tend to go thin. Trivy says "block"; the team disagrees; an exception is granted in a Confluence page; ServiceNow records that an exception exists; six months later, no one can reconstruct what evidence justified it or whether the conditions under which it was granted still hold. The pipeline knew. The policy engine knew. The GRC system knew that something was decided. The composite — what was actually decided, on what basis, by whom, expiring when — is rarely available as a single verifiable record.

Four requirements for the release decision to function as evidence

The complaint above compresses into a small number of requirements. The release decision needs four properties to function as audit evidence and as engineering output simultaneously. Each addresses a recurring failure mode of the existing tooling combination. None of the four is novel in isolation. Their composition is what closes the gap.

Signed and versioned configuration

The configuration governing a release decision should itself be an auditable artefact, not a moving target. "Why did we decide this?" needs to resolve to "because this configuration, signed by this authority on this date, said so." A configuration that anyone with repository access can edit between two builds is not configuration in the audit sense — it is the same kind of object as the source code below it, governed by the same rules of casual modification.

The mechanisms exist. Detached signatures over the configuration file (gpg, sigstorI'm borrowing a term from another field.e, ed25519); trust stores with role-based signing rights; sealed configuration produced at the moment of issue and verifiable thereafter without recourse to the issuer. SLSA's attestation framework provides one model; cosign's signed manifests another. The technique is not exotic. Its absence in most release pipelines is.

The downstream consequence is concrete. When an auditor asks how a release decision was reached on 14 February, the answer is not "according to the policy that was in our repository at the time"; it is "according to configuration hash X, signed by key Y, valid between dates A and B." The hash is independently verifiable. The trust store records who held key Y on that date and whether that authority has since been revoked. The decision becomes anchored.

Canonicalised evidence

Evidence feeding the decision must be deterministic and independent of the source scanner. A vulnerability serialised by Trivy and the same vulnerability serialised by Grype should produce the same decision. The need is not new; it is what the SBOM and VEX standardisation efforts have been working toward for half a decade. CycloneDX VEX, OpenVEX, SPDX, SARIF — each carries part of the load. A canonical input layer that consumes any of these and emits a single internal representation lets the decision logic be tested against ground truth rather than against scanner quirks.

Canonicalisation also forces an explicit decision about what is and is not evidence. CVE identifier, base score, exploit prediction, reachability flag, component identifier, version range — first-class. Free text from scanner output — not. Fixing the schema means that arguments about whether a particular release was correctly classified are arguments about data, not about who interpreted what.

Deterministic and weighted decision

Given a signed configuration and canonicalised evidence, the same inputs should always produce the same decision. The decision logic should be a pure function — if it depends on time of day, network availability, or scanner version, it cannot be audited reliably. Determinism is what makes the decision reproducible six months later, when an auditor wants to verify that yes, given what was known then, the decision was the one the configuration prescribes.

Weighting is the engineering substance. A useful release decision composes several factors, not one. CVSS is one input. EPSS, where available, refines it. The reachability of the vulnerable path qualifies it. The asset's exposure profile (internet-facing or not, authenticated or not) modulates it. Compensating controls (WAF, network segmentation, runtime protection) reduce the residual. Asset criticality scales it. Each factor enters the score in a defined way; the formula is part of the signed configuration; no factor is invented at decision time.

The empirical calibration of those weights is — to be honest about it — an open problem. CVSS-BT incorporates threat intelligence; EPSS provides probabilistic lift; the KEV catalogue identifies known-exploited cases. The literature is not silent. What is missing is a single model that integrates these inputs reliably across organisations of different sectors and sizes. A defensible release decision system documents its weights, exposes them to inspection, and treats their values as configuration revisable against evidence — rather than burying them in code.

Append-only record

Once made, the decision goes onto a record that resists later modification. Every release decision becomes a row in a log; every exception (acceptance of a vulnerability under a justification, with a defined expiry) becomes a row in a related log. The records are content-addressable, signed, chained, and exportable. Forensic integrity comes from the chain — modifying a past entry requires forging every subsequent entry's hash. The mechanism is not new. Git is itself an append-only log; transparency logs (Sigstore Rekor, Trillian) generalise the pattern; certificate transparency normalised it for an entire industry.

Novelty is not the point. The point is that release decisions, as currently practised in most organisations, do not live on such a log. They live in CI logs that rotate, exception emails that get archived, GRC tickets that close, Slack threads that age out. None of those are tamper-evident. None of them survive audit beyond the period of casual recall. An append-only record of release decisions is the difference between "we are confident we did this correctly" and "we can demonstrate, to a third party, exactly what was decided and why".

Composition, not displacement

What does this look like in practice? Not as a replacement for the tooling described above, but as a thin layer between scanner and audit, consuming what the existing tools already produce and emitting what those tools do not produce on their own.

A scanner — Trivy, Grype, Snyk, the GitHub-bundled scanner — runs in the pipeline and emits its native output. A canonicalisation step transforms that output into a release-decision input envelope, conforming to a defined schema. The input envelope and a signed configuration are passed to a decision component, which computes the release decision according to the documented weighting. The decision, the breakdown of how it was computed, the configuration hash, and the evidence hash are written to an append-only log. Exceptions, where granted, follow the same path with an additional signed acceptance.

The components on either side do not change. The pipeline still calls Trivy. The GRC platform still tracks compliance posture at the programme level. The policy engine, where it operates, still enforces declarative rules about non-vulnerability concerns — image provenance, IaC policy, secret scanning, license obligations. The decision artefact sits in the middle, consuming scanner output and feeding the GRC ledger.

This implies a particular operational discipline. The pipeline must call the canonicalisation step rather than consume scanner output directly. The configuration must move out of an editable YAML in the repository into a signed artefact under a trust store. The audit log must be persisted somewhere that survives the build agent. Each of those steps is incremental. None of them requires displacing existing tools. The cost is real but bounded: it is the cost of making the release decision the same kind of object as the release artefact itself.

What this gains, what remains open

For practitioners on the engineering side, the gain is leverage. Release decisions stop being moments of friction with security and compliance and become an interface with defined inputs and defined outputs. A pipeline that produces signed decisions is also a pipeline whose decisions can be tested, refactored, and improved without each change requiring a fresh negotiation. The release gate becomes part of the codebase, not part of the politics.

For auditors and risk officers, the gap between what the documentation says and what the system does becomes inspectable. Today, an auditor's job at a release governance review consists largely of correlating documents — policy says X, change record says Y, ticket says Z — and inferring whether the decision was sound. With signed decisions, the inference shrinks. The decision is the record. The configuration that governed it is the policy. The expirations on accepted exceptions are themselves enforceable by the next pipeline run. The audit becomes verification, not reconstruction.

What is not yet settled, honestly, includes three things.

First, empirical calibration. Recent work on CVSS supplements (CVSS-BT incorporating threat intelligence, EPSS providing probabilistic lift, the KEV catalogue tracking known-exploited cases) has improved the inputs available to a composed risk model. Integrating those inputs into a model that gives reliably useful decisions across organisations of different size and sector remains open. The right posture is to treat weights as configuration, expose them to inspection, and revise them against evidence.

Second, granularity. A release decision computed at the asset level treats all vulnerabilities in a deployed asset under the same exposure profile. Reality is finer-grained: a vulnerability in a build-time dependency has different reachable surface than one in the runtime data path. Component-level granularity makes the decision more accurate and more demanding to maintain. This is a design space, not a solved question. Any concrete implementation has to declare where it sits.

Third, adoption. Three paths to this concept are visible: extending policy-as-code engines with vulnerability-specific reasoning and signed configuration; building dedicated tooling that occupies the layer between scanner and ledger; integrating the function into platforms (GRC, ASPM) that already span engineering and audit. None has reached a position from which the others are obviously displaced. The terrain is still settling.

For the CTI and detection side — useful in passing — signed release decisions are the cleanest telemetry a programme will produce on its own deployments. Knowing exactly what was released, under which policy, with which exceptions still active, with which evidence at decision time, is operationally precious. Incident reconstruction stops being a forensics-on-archives exercise and starts being a query against a log. Detection engineering against the deployed surface becomes precise rather than approximate. The release decision is upstream telemetry; treating it that way pays out late but pays out reliably.

Now, some objections to this approach are worth stating plainly, because they are in parts correct. A composed risk score is not a measurement — it is a heuristic. The factors that go into it (severity, exploitation likelihood, reachability, compensating controls) are real, but the specific way they are combined has no theoretical derivation and no validated ground truth. There is no empirical basis for multiplying rather than weighting, no settled evidence for the exact reduction a WAF confers, no consensus on where to place the cap on compensating effects. The literature on how to weight CVSS against EPSS, reachability against compensating controls, has no settled answer. A policy that automates a bad calibration at scale produces systematic errors faster than any individual human could. And a cryptographic signature guarantees the integrity of the process, not the soundness of the policy that the process executed. These are real limits, not edge cases.

The distinction that matters is not between automated and human decisions, but between attributed and unattributed ones. The four requirements described above do not remove the human from the release decision. They make the human accountable for it. The scoring weights are configuration — written by a person, signed by a person, revisable by a person. An exception is a signed statement by a named actor with a justification and an expiry date, not a flag flipped in a database. Business context, strategic timing, a CEO's appetite for risk at a specific moment — these enter through the acceptance mechanism, in writing, under a key. What the system produces is a record of what the humans decided, not a substitute for the decision itself.

The alternative worth comparing against is not a perfect risk model. It is the current state: decisions taken without a trail, calibrations that exist only as institutional memory, exceptions granted in Slack threads that no one can reconstruct six months later. A documented imperfect decision is auditable, correctable, and attributable. An undocumented one is nothing. The goal here is the former, not the latter — and that goal does not require the scoring model to be perfect. It requires it to be visible.

The technique above does not require new vendors, new categories, or new acronyms. It requires that the release decision be the same kind of object as the artefact it governs — signed, versioned, deterministic, recorded. With those four in place, the decision becomes evidence rather than opinion.

References

Apache Software Foundation. (2017). CVE-2017–5638 — Struts 2 OGNL injection. https://nvd.nist.gov/vuln/detail/CVE-2017-5638
CISA. (2021). Mitigating Log4Shell and Other Log4j-Related Vulnerabilities (AA21–356A).
CycloneDX. (2024). Vulnerability Exploitability Exchange (VEX) specification. OWASP Foundation.
European Parliament & Council. (2022). Directive (EU) 2022/2555 (NIS2) — measures for a high common level of cybersecurity. OJEU L 333/80.
European Parliament & Council. (2022). Regulation (EU) 2022/2554 (DORA) — digital operational resilience for the financial sector. OJEU L 333/1.
Hu, Q., Hart, P., & Cooke, D. (2007). The role of external and internal influences on information systems security — a neo-institutional perspective. Journal of Strategic Information Systems, 16(2), 153–172.
International Organization for Standardization. (2022). ISO/IEC 27001:2022 — Information security management systems.
Jacobs, J., Romanosky, S., Edwards, B., Roytman, M., & Adjerid, I. (2021). Exploit prediction scoring system (EPSS). Digital Threats: Research and Practice.
Niemimaa, E., & Niemimaa, M. (2017). Information systems security policy implementation in practice: from best practices to situated practices. European Journal of Information Systems, 26(1), 1–20.
OpenVEX. (2024). OpenVEX specification v0.2.0. https://openvex.dev
OWASP Foundation. (2024). Software Assurance Maturity Model (SAMM) v2.1. https://owaspsamm.org
SLSA. (2024). Supply-chain Levels for Software Artifacts, v1.0. https://slsa.dev
US House of Representatives, Committee on Oversight and Government Reform. (2018). The Equifax Data Breach: Majority Staff Report.

#software-release #risk-management #application-security #supply-chain-security #grc

< Go to the original