This is Part 3 (Final) of the first Zenzic Engineering series. If you missed the earlier instalments:
Part 1: Your Documentation is a Leaking Pipe
Part 2: What Happens When You Rip the Foundation Out of a Security Tool
Documentation rots silently. Broken links, orphan pages, stale placeholders — they accumulate until a user hits them in production. But the most dangerous rot isn't a 404. It's the API key a developer pasted into a code fence and forgot about.
Zenzic is an open-source documentation security framework that catches all of this. It scans Markdown and MDX files across Docusaurus, MkDocs, and vanilla documentation projects — checking links, references, orphans, assets, snippets, and, critically, leaked credentials. No build framework needed. No subprocesses. Pure static analysis.
But how do you know your security scanner actually works?
You attack it.
What Shield Is — and Why Breaking It Matters
Shield is Zenzic's credential detection layer. It scans every Markdown and MDX file in your documentation before the build runs, looking for patterns that indicate real credentials in content.
The threat model is simple: a contributor submits a PR with a code example. That example contains a real API key — copied from a local terminal session, pasted from a Slack thread, or forgotten after a debugging session. The reviewer reads the prose, not the bytes. The PR merges. The docs build. The key is now live on your documentation site, indexed by search engines.
Shield exists to catch that before it ships. If Shield can be bypassed by someone who knows how it works, it's not a scanner — it's a false guarantee.
The Attack Surface
Shield's architecture before Operation Obsidian Stress followed a straightforward pipeline:
- Read each line of the Markdown/MDX file
- Apply a normalization pass (strip backticks, collapse whitespace)
- Run 9 regex patterns against the normalized line
- Report any match as a ShieldFinding
Step 4 triggers Exit Code 2 (Shield breach) — non-bypassable, distinct from Exit Code 1 (validation failure) and Exit Code 3 (Blood Sentinel / path traversal).
The attack surface was step 2: the normalization pass. It normalized formatting noise but did not account for deliberate obfuscation.
A note on methodology: The "Red Team", "Blue Team", and "Purple Team" referenced in this article were ensembles of specialized AI agents orchestrated under a multi-team agentic architecture — simulating real-world attack vectors with surgical precision. The findings, bypass vectors, and fixes described below are real. We believe in transparency about our process.
Operation Obsidian Stress
Before releasing v0.6.1rc2, we ran a controlled security audit we called Operation Obsidian Stress. The operation was structured around three adversarial mandates:
- Red Team: Break the Shield (credential scanner), bypass the Blood Sentinel (path traversal guard), and test DoS resilience.
- Blue Team: Verify the Virtual Site Map integrity, i18n fallback chains, and compliance behavior.
- Purple Team: Audit CI/CD pipelines, performance benchmarks, and documentation accuracy.
The Red Team found four bypass vectors. All four were real. All four are now closed.
What follows is a precise accounting of how each bypass worked, why it worked, and how it was sealed.
The 4 Bypass Vectors
ZRT-006: The Invisible Characters
Severity: High — complete bypass of all regex patterns (CVSS analogy: 8.1)
Unicode defines a character category called Cf (Format characters):

These characters are invisible to human readers. They are not invisible to regex engines.
An attacker embedding a credential in documentation could write:
sk-abc123​def456ghi789...The zero-width space character sits between 123 and def. The regex pattern sk-[a-zA-Z0-9]{48} no longer finds a contiguous match. The credential leaks through the scanner in plain sight.
The fix: Strip all Unicode Cf-category characters before passing any line to the pattern matching layer.
import unicodedata
def _strip_unicode_format_chars(text: str) -> str:
return "".join(c for c in text if unicodedata.category(c) != "Cf")This fix must run first in the normalization pipeline. Any subsequent step that introduces Cf characters would re-introduce the bypass.
ZRT-006 (Part 2): HTML Entity Obfuscation
Markdown renderers decode HTML entities. A documentation page containing:
sk-abc123def456ghi789...Will render as sk-abc123def456ghi789... — a syntactically valid OpenAI API key. But - (the HTML entity for the hyphen character -) breaks the raw regex match in the credential scanner.
This is a minimal, one-character substitution. A human proofreader reviewing the raw Markdown source would likely not notice it. The rendered output looks perfectly normal. The credential escapes.
Affected credential families if left unpatched:
sk-...(OpenAI): hyphen obfuscated as-sk_live_...(Stripe): underscores obfuscated as_ghp_...(GitHub): underscore in prefix obfuscated
The fix: Run html.unescape() over the line content before any pattern matching.
import html
def _decode_html_entities(text: str) -> str:
return html.unescape(text)ZRT-007: Comment Interleaving
HTML comments (<!-- ... -->) and MDX expression comments ({/* ... */}) are invisible in rendered documentation output. A contributor who knows this can use them as credential-splitting noise:
sk-abc123<!-- this is a comment -->def456ghi789..The rendered page shows the full character sequence. The scanner sees sk-abc123 followed by comment noise followed by def456ghi789... — neither segment matches the full credential pattern.
The fix: Strip all HTML comments and MDX comments before pattern matching.
import re
_HTML_COMMENT_RE = re.compile(r"<!--.*?-->", re.DOTALL)
_MDX_COMMENT_RE = re.compile(r"\{/\*.*?\*/\}", re.DOTALL)
def _strip_comments(text: str) -> str:
text = _HTML_COMMENT_RE.sub("", text)
text = _MDX_COMMENT_RE.sub("", text)
return textZRT-007 (Part 2): Cross-Line Token Splitting
This was the most architecturally interesting attack — not because it was technically sophisticated, but because it exploited the most fundamental assumption in the scanner's design.
Every line-by-line scanner assumes that credentials exist within a single line. An attacker who knows this can subvert the scanner with a single line break:
This is my API key for the staging environment: sk-abc123def456
ghi789jkl012mno345pqr678stu901vwx234yz567No single line contains the complete credential pattern. The scanner processes line one, finds no match. It processes line two, finds no match. The credential leaks.
The fix: A stateful lookback buffer — a generator that maintains context between lines, creating a synthetic join zone where cross-line credentials become visible.

def scan_lines_with_lookback(
lines: Iterable[tuple[int, str]],
file_path: Path,
buffer_width: int = 80,
) -> Iterator[ShieldFinding]:
prev_normalized = ""
prev_seen: set[str] = set()
for line_no, raw_line in lines:
seen_this_line: set[str] = set()
normalized = _normalize_line_for_shield(raw_line)
# Pass 1: scan the current line in isolation
for finding in _scan_normalized_line(normalized, file_path, line_no):
yield finding
seen_this_line.add(finding.family)
# Pass 2: scan the join zone
if prev_normalized:
join_zone = prev_normalized[-buffer_width:] + normalized[:buffer_width]
for finding in _scan_normalized_line(join_zone, file_path, line_no):
if finding.family not in (seen_this_line | prev_seen):
yield finding
prev_normalized = normalized
prev_seen = seen_this_lineThe buffer_width parameter (defaulting to 80 characters) is a deliberate conservative choice. A credential split across a line boundary will almost always appear within the first and last 80 characters of the respective lines. A larger buffer increases the false-positive surface; a smaller buffer introduces missed-detection risk. 80 characters reflects the tail of a prose line that ends mid-credential.
The lookback buffer adds measurable but acceptable overhead:

The overhead is roughly linear. A 5,000-file documentation corpus completes in 626 ms on a mid-range runner — well within CI pipeline acceptable ranges.
The 8-Step Normalization Pipeline
After sealing all four vectors, Shield's normalization layer runs every line through a deterministic sequence of eight transformations before any regex pattern touches the content:


Order matters. Stripping Unicode Cf characters before HTML entity decoding ensures that entities cannot be obfuscated with format character injection. Stripping comments before backtick unwrapping ensures that comment-wrapped content inside code spans doesn't survive normalization. Each transformation creates the conditions under which the next transformation is safe to apply.
What Shield Catches
Shield scans for 9 credential families across all normalized content:

A Shield breach triggers Exit Code 2 — distinct from validation failures (Exit 1) and path traversal detection (Exit 3, "Blood Sentinel"). This makes CI/CD integration unambiguous across all failure modes.
Exit Code Taxonomy
Zenzic's exit codes are non-negotiable — no configuration can suppress them:

Codes 2 and 3 cannot be configured away. A CI step that can be silenced on a security failure is not a security control.
The Risk Management Dimension
The four bypass vectors found during Operation Obsidian Stress have a common property: they are not obscure edge cases. They are techniques that appear in standard lists of regex evasion methods used in adversarial content scenarios. They would be discovered by any documentation contributor with moderate knowledge of Unicode, HTML encoding, and regex mechanics.
The risk profile of an unpatched documentation scanner is not "low probability, low impact." It is "moderate probability, high impact" — because credential leaks in documentation have immediate material consequences, and because documentation pipelines receive content from the broadest possible contributor population.
This is the supply chain risk dimension that is most frequently underweighted: not the vulnerability of your infrastructure, but the vulnerability of the content processing path you expose to your contributor base.
A security tool that can be bypassed by a contributor who knows how it works is not a security tool. It is a compliance checkbox.
Beyond Security: The Full Zenzic Surface
Shield is one layer in a complete documentation quality framework:
- Link validation against a Virtual Site Map — no live server required
- Orphan detection — pages that exist but are unreachable in the navigation graph
- Snippet verification — code blocks referencing files that don't exist on disk
- Placeholder scanning —
TODO,FIXME,TBDin published content - Asset auditing — unused images consuming repository space, with autofix
- Reference integrity —
[broken][ref]-style links with missing definitions - Quality score — deterministic 0–100 metric with regression detection
All analysis is engine-agnostic: auto-detection covers MkDocs, Docusaurus v3, and Zensical, with a vanilla fallback. No plugins to install. No build to run. No subprocesses.
CI Integration
# .github/workflows/docs.yml
- name: Zenzic Shield
run: |
pip install zenzic==0.6.1rc2
zenzic shield --strict
# Exit code 2 → credential found → build fails
# Exit code 3 → path traversal → build fails
# No --ignore-shield flag existsThe Numbers After Operation Obsidian Stress

The Obligation of the Bastion
"The Bastion holds" is not a marketing phrase. It is an engineering commitment. It means that every identified attack path has been closed, that the closure has been verified with test coverage, and that the system's failure modes under adversarial input are bounded and known.
It does not mean that future bypass vectors don't exist. Red team exercises are not proofs of security — they are evidence of the security posture at a specific moment in time. The four vectors found during Operation Obsidian Stress were found because we looked for them systematically. Vectors we haven't enumerated may still exist.
What the Bastion commitment means is that we look — methodically, adversarially, and transparently about what we find.
Coverage Added by Operation Obsidian Stress
Before the operation: 929 passing tests. After closing all four vectors: 1,046 passing tests.
117 new tests, distributed across:

This concludes the Obsidian arc. The next series will follow Zenzic's evolution beyond security scanning — into the territory of documentation intelligence, adaptive rule engines, and what happens when a linter starts understanding the intent behind the content it analyzes.
Run It Against Your Docs
If your documentation is part of your build pipeline, it deserves the same validation rigor as your source code.
pip install --pre zenzic
# Full analysis (links + orphans + credentials + assets)
zenzic check all
# Security scan only
zenzic shield
# Let Zenzic auto-detect your engine
zenzic lint
# Or specify explicitly
zenzic lint --engine docusaurus
zenzic lint --engine mkdocs
# Quality score with regression detection
zenzic score
zenzic diff --baseline .zenzic-baseline.jsonRun it on your repo. See what it finds — before your users do.
Resources
📦 PyPI: pypi.org/project/zenzic
💻 GitHub: github.com/PythonWoods/zenzic
📖 Docs: zenzic.dev
The Zenzic Engineering Series (Season 1)
← Part 1: Your Documentation is a Leaking Pipe
← Part 2: What Happens When You Rip the Foundation Out of a Security Tool
🏁 Part 3: We Put Our Documentation Linter Under an AI-Driven Siege. Here's the Post-Mortem. (You are here)
What's next? This concludes the first cycle of Zenzic Engineering. Follow this list to be notified when the next series of deep-dives and research papers are published.