Every developer knows the sudden, stomach-dropping panic.
You just ran git push. A few seconds later, it hits you: some personal information got leaked to the repo from your device. Once it's in your Git history, the damage is done. Scrubbing it from the cloud is a nightmare, and you're left praying a bot didn't scrape it in the five seconds it was public.
There are cloud-based scanners that promise to catch this, but they come with a massive irony: to protect your private data, you have to send your codebase to a third-party server.
That didn't sit right with me. So, I built scan_pii.
Meet scan_pii: Your Offline Security Net
scan_pii is a local-first, GPU-accelerated Python tool that aggressively scans your files (and every single commit in your Git history) for Personally Identifiable Information (PII).
It acts as a safety net that catches your mistakes before they ever leave your machine.
Instead of relying on basic regex rules that miss edge cases, scan_pii uses NVIDIA GLiNER, a Zero-Shot Named Entity Recognition (NER) NLP model. Because it's CUDA-accelerated, it runs almost instantly on your local hardware.
Why this approach beats the alternatives:
- Contextual Detection: Zero-Shot NLP understands context, catching sensitive data that standard regex misses.
- Deep History Scans: It doesn't just check your current working directory. It checks every unique version of every file hidden in your Git history.
- Total Privacy: It runs entirely offline. Your code never leaves your machine.
- Speed: GPU optimization means you aren't waiting minutes for a pre-commit hook to finish.
How to Set It Up in 3 Minutes
You'll need Python 3.9+, a CUDA-capable GPU, and a pyenv virtual environment (named pytorch-env).
1. Install the dependencies:
pyenv activate pytorch-env
pip install gliner torch tqdm tabulate2. Save the core logic: Grab the scan_pii_logic.py file from the GitHub repo and drop it into your local bin.
mkdir -p ~/.local/bin
# Move scan_pii_logic.py into this folder3. Create the global wrapper: Open your terminal and create the executable:
sudo nano /usr/local/bin/scan_piiPaste this inside (it routes the command to your specific Python environment):
#!/bin/bash
/home/$(whoami)/.pyenv/versions/pytorch-env/bin/python ~/.local/bin/scan_pii_logic.py "$@"Finally, make it executable:
sudo chmod +x /usr/local/bin/scan_piiCatching Secrets in the Wild
To use it, just navigate to any local project repository and run: scan_pii .
The scanner will output a clean table detailing exactly what it found, the confidence level, and whether the leak is currently live or buried in your Git history.
SCAN COMPLETE: 8 potential leaks found!
+-----------------------------+---------+-------------------+--------+------------+
| File/Source | Type | Extracted Value | Conf | Location |
+=============================+=========+===================+========+============+
| LICENSE | person | Peter | 1 | Live |
| scan_pii_logic.py | api key | nvidia/gliner-pii | 0.98 | Live |
| scan_pii_logic.py (51a8891) | person | ain | 0.83 | History |
+-----------------------------+---------+-------------------+--------+------------+Notice a leak in an old commit? You can grab the hash straight from the table and inspect it using git show <commit-hash>.
The Bottom Line
Git history is forever. Even if you delete an API key from your active branch, it's still lurking in your commits. Whether you want to audit a legacy repository, ensure compliance, or just save yourself from a late-night panic attack, scan_pii gives you the peace of mind you need.
(Disclaimer: No tool is perfect. Always complement NLP scanners with strict .gitignore files, environment variables, and secure secrets management.)
Ready to secure your local workflow? Check out the full repository and source code here.