10 CVEs, 16 Broken Endpoints, and a SECRET_KEY That Wasn't Secret

Auditing my own open-source auth platform, two months after shipping v1.0

Subhrajit Mohanty

~10 min read · April 23, 2026 (Updated: April 23, 2026) · Free: No

The test that should have passed

Two months after I shipped Rudra v1.0, I sat down to write what I thought was a routine integration test.

The setup was simple. Admin A creates a project. Admin B logs in with a completely separate account and tries to delete admin A's project. The expected result: 404 Not Found. Admin B shouldn't even know that project exists.

The test failed.

Admin B got back a cheerful 200 OK. The project was gone. Admin A's realm, users, roles, webhooks, SSO configurations, all of it, deleted by a completely unrelated admin who just happened to know the realm name.

That's the moment I realized my "finished" auth platform had a bug that would have been a CVE in any commercial product.

I stared at the green passing CI pipeline on my other monitor for a long time.

What I'd actually shipped

For context in case you missed the first post: Rudra is an open-source, self-hosted auth platform built on top of Keycloak. I launched v1.0 in February 2026 as an alternative to the SaaS auth providers whose bills quietly scale into four figures. FastAPI backend, React dashboard, Python and JS SDKs, one docker compose up, MIT licensed.

The reception was kind. The GitHub stars were flattering. Engineers in my DMs were asking when v1.1 would ship.

I thought v1.1 would be features. More UI components. Magic links. Passwordless flows.

Instead, v1.1 became the release where I sat down with my own codebase and did not like what I found.

The authorization bug hiding in plain sight

Here's what was actually happening on sixteen of my endpoints.

A tenant-scoped endpoint like DELETE /api/tenants/{realm}/organizations/{slug} did the right thing at the front door. It required a valid JWT. It verified the signature. It pulled out the admin's email.

Then it did absolutely nothing with that email.

The endpoint would fetch whatever realm you asked for, find whatever organization you asked for, and delete it. There was no line of code anywhere that checked "is this admin actually the owner of this realm?"

Sixteen endpoints had this problem. Session deletion. Role assignment and removal. Organization membership. SAML provider setup. Client deletion. Webhook management. All of them were gated only by "are you logged in as some admin" and not "are you logged in as the admin who owns this."

In a platform that markets itself on multi-tenant isolation, this is the single worst bug you can ship.

The fix was a single helper:

async def _get_owned_tenant(realm: str, admin: AdminUser) -> dict:
    tenant = await db.tenants.find_one({"realm": realm})
    if not tenant or tenant["owner_email"] != admin.email:
        raise HTTPException(404, "Not found")
    return tenant

Two things about that code worth noting.

First, the response is 404, not 403. A 403 "you don't have permission" confirms the resource exists, which lets an attacker enumerate other tenants by sending probes. 404 keeps the existence of other admins' realms completely opaque.

Second, I wired this helper through the existing _check_feature function that was already doing plan-gating on four endpoints. Those four endpoints picked up the ownership check in the same pass. From now on, no endpoint can enforce plans without also enforcing ownership. Coupling them structurally is the only way I trust myself not to forget again.

Then I wrote the regression suite that would have caught this at v1.0. Two admins, one owns a realm, the other hits 404 on read, delete, session revoke, invitation creation, and role assignment. It runs on every pull request. If any of these ever returns 200 again, CI fails.

The CI that was lying to me

The reason this bug went unnoticed for two months is that my CI pipeline was green on every commit.

It was green because my CI pipeline wasn't actually testing anything.

I had two jobs named Test Python SDK and Test JavaScript SDK. They sounded reassuring. What they actually did was import the SDK module and check that certain attributes existed on it. No behavior was exercised. No authentication flow was triggered. No API endpoint was called. A catastrophic regression in literally any piece of business logic would not have been caught.

In v1.1, I renamed those jobs to Validate Python SDK imports and Validate JavaScript SDK imports. The behavior is unchanged. The name no longer pretends coverage I didn't have.

Then I wrote the test suite that should have existed at v1.0.

Twenty-three unit tests covering password hashing (including Unicode, malformed input, deterministic tampered-signature detection, expired tokens, algorithm pinning), the _required config helper, and Pydantic validator edge cases.

Fifteen integration tests covering health, auth flow, and tenant isolation, running against a real MongoDB container with mocked Keycloak. The Keycloak mocking is deliberate: a real Keycloak boot adds about two minutes per CI job, and I'm not asserting on Keycloak behavior in these tests, I'm asserting on Rudra's behavior.

Total runtime for all 38 tests: about three seconds locally. Fast enough that there is no excuse for not running them before pushing.

And critically, the Docker image build is now gated on tests passing. Before, Build Docker Images had no upstream dependency. It would build and tag an image even if every test in the repo failed. Now it sits behind Unit Tests and Integration Tests. No tests, no image, no green tick.

The secret key that wasn't

Here's a line of code from v1.0's backend/config.py:

SECRET_KEY = os.getenv("SECRET_KEY", "super-secret-key")

That fallback string is used to sign every JWT the platform issues. If you deployed Rudra to production and forgot to set the SECRET_KEY environment variable, the backend would boot happily, serve traffic, and sign its tokens with a string that is literally printed in my public GitHub repo.

Anyone who noticed could forge admin tokens.

The v1.0 docker-compose.yml made this worse. It baked admin / admin as the default Keycloak credentials. It hardcoded super-secret-key-change-in-production as SECRET_KEY. It shipped inline passwords for Postgres, Mongo, and Redis. The quick-start said docker compose up --build with zero mention of a .env file.

Meaning: a non-trivial number of early adopters were almost certainly running production-ish deployments with a signing key that was public knowledge.

The v1.1 fix is uncomfortable but correct:

def _required(name: str) -> str:
    value = os.getenv(name, "").strip()
    if not value:
        sys.stderr.write(f"FATAL: {name} is required but not set.\n")
        sys.exit(1)
    return value
SECRET_KEY = _required("SECRET_KEY")

The backend now refuses to boot without a real SECRET_KEY. Same treatment for KEYCLOAK_ADMIN_USER, KEYCLOAK_ADMIN_PASSWORD, MONGODB_URL, and REDIS_URL. docker-compose.yml interpolates everything from ${VAR}. There's a .env.example with inline guidance on generating a real key:

SECRET_KEY=change_me_generate_a_long_random_value
# Generate with: python -c "import secrets; print(secrets.token_urlsafe(64))"

Yes, this breaks docker compose up for anyone upgrading who was relying on the defaults. That break is the feature. If the defaults worked, nobody would change them.

The dependencies that rotted in ten weeks

The other thing a missing CI check had been hiding: my pinned backend dependencies had been quietly accumulating vulnerabilities since February.

I added pip-audit --strict to CI. It immediately flagged ten published CVEs across my dependency tree:

starlette had CVE-2024-47874 and CVE-2025-54121
python-jose had PYSEC-2024-232 and PYSEC-2024-233
pyasn1 had CVE-2026-30922 (couldn't be fixed in isolation, python-jose had it pinned; bumping python-jose to 3.5 finally unlocked pyasn1 0.6)
python-multipart had three: CVE-2024-53981, 2026-24486, 2026-40347
pymongo had CVE-2024-5629

None of these are theoretical. All have published advisories.

The fix was boring: bump the versions, run the tests, pin the new versions. pip-audit --strict now runs on every PR against both backend/requirements.txt and sdk/python/requirements.txt. npm audit --audit-level=high runs against the frontend. Any new CVE at high severity or above fails the PR.

This is the structural fix that actually matters. I don't want to remember to audit dependencies. I want CI to block me when I don't.

The containers running as root

Rudra's v1.0 Dockerfiles both ran as root. No USER directive. No non-root user. A compromise in FastAPI or nginx would have landed an attacker with root inside the container.

The v1.1 backend runs as a dedicated rudra system user (uid 1001). The frontend image drops privileges the same way. Both images have proper HEALTHCHECK directives so docker compose ps reports meaningful status instead of just "up since X seconds." The compose file dropped the obsolete version: '3.9' key that Docker Compose V2 has been warning about for months.

The frontend image also finally uses npm ci against a committed package-lock.json, not npm install with no lockfile. I'd been shipping whatever npm felt like resolving at build time, which is not a reproducibility story I want to be telling anyone.

The version strings that disagreed with each other

A small indignity, but one that bothered me: in v1.0.1, seven different files declared seven different understandings of what version was running.

pyproject.toml said 1.0.1. sdk/python/setup.py said 1.0.1. But backend/main.py FastAPI title said 1.0.0. The /api/health payload said 1.0.0. frontend/package.json said 1.0.0. sdk/javascript/package.json said 1.0.0.

If you'd hit /api/health in production and opened a bug report against "v1.0.0," I'd have had no idea which version you were actually running.

All seven now say 1.1.0. All at once. Every release from here on bumps all seven in a single commit.

What v1.1 actually is

This release doesn't add a single new API endpoint. Existing SDK calls from v1.0 continue to work unchanged. From a product-feature lens, nothing happened.

What actually happened:

16 endpoints got the authorization check they always should have had
10 published CVEs got closed in the backend's dependency tree
38 real tests (23 unit, 15 integration) replaced an empty test suite
The CI pipeline went from 4 jobs to 6, with the Docker build gated on tests passing
Both Docker images now run as non-root with proper healthchecks
.env became mandatory; the backend refuses to boot without one
A CHANGELOG.md was created, going back to v1.0.0, following Keep a Changelog
17 product screenshots were captured and woven into the README, a dedicated docs/screenshots.md, and docs.html
Version strings got aligned across all seven surfaces

15 commits. 52 files changed. 3,046 lines added, 137 removed.

It's a boring release. It's also the most important release I've shipped.

What this taught me that I didn't expect to learn

A few things have stuck with me from the last ten weeks.

Shipping v1.0 and shipping production-ready are different projects. I'd conflated them. I thought "feature complete, deploys with one command, has SDKs" was the same thing as "you should actually run this in production." It isn't. The first is table stakes. The second requires the boring stuff: threat model, ownership checks, dependency hygiene, honest tests, container hardening, secret management. None of that is visible from a feature list. All of it determines whether your thing deserves to be trusted.

A green CI badge is meaningless if the tests don't test anything. I'd been looking at my passing pipeline and feeling reassured. The badge was lying because I'd written jobs that were named like tests but didn't behave like tests. Go look at what your CI jobs actually execute. Not what they're called. What they execute.

Defaults get used. Every default in v1.0's compose file was insecure, and I knew it was insecure, and I'd told myself I'd remove it "before anyone actually used this." People actually used it. The defaults were used. Write defaults assuming production deploys depend on them, because some will.

404 is often better than 403 for multi-tenant systems. 403 confirms the resource exists. 404 reveals nothing. If you're building anything with tenant isolation, default to 404 on authorization failures.

pip-audit in CI is not optional. Not for any Python project that touches the internet. The amount of effort is one line of YAML. The downside of not having it is finding out about your CVEs from someone else.

Upgrading from v1.0.x

Four things worth knowing if you're running Rudra already:

You need a .env now. Copy .env.example, fill in real values, especially SECRET_KEY. The backend refuses to boot otherwise. This is the only real breaking change.
All existing JWTs become invalid. Because SECRET_KEY changes (you're moving off the public default), any tokens signed under the old key fail verification. Users have to log in again once. There's no migration path for this that doesn't involve that. Sorry.
No API contract changes. Endpoints, request shapes, response shapes, SDK methods, all unchanged. Code written against v1.0.x keeps working.
Scripts that relied on the missing authorization check will break. If any automation used admin A's token to act on admin B's realm, that's now a 404. This is the fix working correctly.

The upgrade itself:

git pull
cp .env.example .env
# edit .env, generate a real SECRET_KEY
docker compose down
docker compose up -d

What's next

The original v1.1 roadmap I'd sketched (UI components, magic links, SCIM, Terraform provider, Helm chart, audit log export) is still the plan. Now that the platform is actually a platform you should be willing to run, I can build on it without flinching.

v1.2 will focus on pre-built React components and magic link auth. The embeddable sign-in widget that moves Rudra closer to SaaS auth DX without moving away from self-hosting.

If you want that sooner rather than later, the way to help is not to wait. Open an issue about what you actually need. Submit PRs. Star the repo if you want me to know you're interested, because I weight feedback from people I can see.

Bottom line

v1.0 was the release where I shipped Rudra. v1.1 is the release where Rudra became something I'd run in production myself.

The features I was excited about in February are still the features that matter: open-source, self-hosted, no per-user fees, B2B-ready, Keycloak underneath. Nothing about the value proposition changed.

What changed is that now, when I tell you to run docker compose up, I mean it without caveats.

GitHub: github.com/subhrajit-mohanty/Rudra Full changelog: CHANGELOG.md License: MIT

If you're on v1.0.x, upgrade. Not for the features, but because of the bugs I just told you about.

If you haven't tried it yet, now is actually the right time.

#programming #technology #software-engineering #software-development #application-security

< Go to the original