We asked the same CISO planning question five ways. The first four produced better recommendations. The fifth changed the structure of the decisions themselves — initiative sequencing reversed, a single governance deliverable split into two artifacts with different audiences and deadlines, and evidence was retrieved in response to disagreements that didn't exist until the analysis was underway.
Here is what each level produces — and where better prompting stops and governed execution begins.
The Experiment
We asked "How should an EU-based mid-market CISO prioritize?" five ways. A generic prompt. Extended reasoning. Six structured domain knowledge files. Those domain files plus pre-selected evidence from a 575-item reference pool. And a governed multi-agent deliberation using GOSTA — an open-source agentic execution architecture designed for multi-domain decision problems.
Two boundaries appeared — one where domain knowledge transforms the output, and another where governance transforms it again.
What Prompts Produce — And Where They Stop
A clean prompt returned eleven priorities for a team of two. Nothing technically wrong. Nothing operationally decisive. Extended reasoning improved sequencing but not grounding — ten priorities, better ordered, still generic.
Adding six domain knowledge files — NIS2 articles, ENISA 2025 threat data, operational capacity constraints, business continuity economics, technology architecture, supply chain risk — collapsed the list from ten to five. Budgets became realistic: €24K–€68K instead of abstract percentages. The AI stopped recommending things a two-person team cannot do.
This is accessible now without a framework: curate domain expertise into structured files, provide them as context, and recommendations become grounded rather than generic. But five good priorities, zero explicit trade-offs between them.
Adding pre-selected evidence produced five labeled tensions: regulatory compliance versus operational feasibility, breadth versus depth, vendor assessment rigor versus feasibility. Generic operational categories — trade-offs any experienced CISO would recognize without AI assistance. None change what you build or in what order. Labels, not tested conclusions.
This is the point where better prompting stops changing the decision.
What Governed Execution Adds
GOSTA uses the same domain knowledge and draws from the same evidence pool. The difference is governance: constraints defined upfront, roles separated, disagreement maintained until evidence resolves it, reversals tested, outputs traceable.
Three agents — each grounded in two domain models — independently assessed priorities, then challenged each other across four rounds. During deliberation, they retrieved sixty-seven evidence items on demand in response to specific analytical needs.
The value became clear once the process started producing decisions the earlier levels could not.
Governance split. The regulatory and threat agents clashed over governance. One pushed NIS2 Article 20 as year one's top priority — personal liability for management bodies. The other pushed back: governance without controls is theater. The deliberation revealed governance is two structurally different artifacts — a 10-page authorization charter (board approval, escalation protocol, 60 hours) and a 150-page compliance codification (full control mapping, Year 2). Different audiences, different deadlines. The charter satisfies Article 20 in months one and two. Controls execute in parallel. Every other level treated governance as one deliverable.
Dependency discovery. The second disagreement was whether vendor attestation or internal identity hardening deserved higher year-one priority. Evidence retrieval showed the two were not competing priorities but linked ones — vendor assurance is only as credible as the maturity of your own identity controls. That dependency surfaced only because two agents disagreed and retrieval was triggered by the disagreement itself.
Reversal testing. The framework then tested its own conclusions. For each sequencing decision, it argued the reverse: what happens if you break this dependency? What does it cost? Three dependencies were confirmed as structural — breaking them causes measurable harm. Three were softened to operational preferences.
Cross-domain finding. Some findings only appeared once the domain perspectives were forced to confront each other. NIS2, GDPR, and DORA share enough structural elements that a unified control framework can reduce duplicated compliance overhead materially. That finding required multiple perspectives to interact before it became visible.
How to Tell Which Questions Need This
Not every decision needs governed execution. The distinction is domain multiplicity and constraint competition.
A question is prompt-grade when it lives within a single domain and has a knowable best answer. "What does NIS2 Article 21 require?" AI handles this well, and domain context makes it better.
A question becomes framework-grade when multiple domains with legitimate perspectives conflict, binding constraints force trade-offs, and the stakeholder needs to see which trade-offs create sequencing dependencies — grounded in traceable evidence and contested reasoning.
CISO annual planning is framework-grade. So is strategic vendor selection when regulatory, technical, financial, and operational perspectives compete. So is any organizational decision where "it depends on which domain you ask" and the decision-maker needs the dependencies, not just the answer.
What This Means
Domain knowledge files are undervalued. The jump from generic prompt to domain-informed prompt was the largest single improvement across all five levels. Every organization should be curating domain expertise into structured files. No framework required.
When a question crosses into framework-grade territory, the framework should not embed domain assumptions — it should govern how domain inputs interact. The governed process does not produce objective truth. It produces reasoning that is contestable, traceable, and stress-tested against reversals and evidence retrieval triggered by disagreement.
Better prompting improves recommendations. Governed execution tests whether those recommendations survive conflict, reversal, and evidence. That is the boundary.
All six domain models, the L1–L4 outputs, and the full L5 session configuration from this analysis are published at github.com/cybersoloss/GOSTA-OSS.
Cybersol governs third-party risk, compliance, and liability exposure for EU organizations under NIS2, DORA, and the CRA.