The Privacy Dilemma in the Age of AI
Today, interacting with AI feels seamless: type a question, receive an insightful answer. But for those of us working with sensitive technical documentation, organizational confidential information, or proprietary research, that convenience carries a hidden cost: data exposure.
If I can't verify where my data goes, how it's stored, or who might access it, should I share it?
Every prompt sent to a cloud-based LLM introduces risk. Even with strong privacy policies, you're ultimately trusting a third party with:
- The raw content of your documents
- Your query patterns and areas of intellectual focus
- Metadata that could reveal project timelines, system architectures, or security postures
As someone who regularly handles confidential materials and unpublished technical writing, I couldn't accept that trade-off. This isn't me being paranoid; it's a documented compliance and security concern grounded in real-world threat models.
The Case for Data Sovereignty
Trust Boundaries Matter
Retrieval-Augmented Generation (RAG) enhances LLMs by connecting them to external knowledge sources. While RAG is privacy-respecting by design (keeping data separate from model weights), the deployment architecture ultimately determines whether your information stays truly private.
Cloud AI APIs offer tremendous power, but they operate outside your control perimeter. When you upload a document to a managed RAG service, you implicitly accept that:
- You don't control data retention or deletion policies
- You can't audit how embeddings are generated, stored, or potentially reused
- You rely on the provider's security posture, not your own
For my work in cybersecurity, that lack of visibility wasn't just uncomfortable; it was a hard blocker.
Compliance, Control, and Offline Access
Beyond personal preference, there are practical scenarios where local AI is essential:
- Data residency requirements: Some projects mandate that data never leaves a specific machine or network
- Air-gapped environments: Security research often happens on isolated systems
- Long-term reproducibility: Local models and vectors ensure your RAG pipeline behaves consistently over time
- Cost predictability: No API calls means no surprise bills or rate limits
My Personal Threshold
I'm not anti-cloud. I use managed services when they make sense. But for knowledge work involving sensitive or unpublished material, my rule is simple:
"If the data shouldn't be public, it shouldn't leave my machine."
That principle drove every architectural decision in my RAG pipeline.
The Solution: A Local-First RAG Architecture
Here's the stack I assembled to keep intelligence local without sacrificing capability:
Core Components
| Component | Technology | Purpose |
|-----------|------------|---------|
| **LLM** | Ollama (Mistral) | Local inference, zero API calls |
| **Embeddings** | Ollama (nomic-embed-text) | 768-dimensional vectors, generated locally |
| **Vector Store** | PostgreSQL + PGVector | Persistent, queryable document storage |
| **Document Processing** | Docling | PDF, Word, PowerPoint, Excel, HTML conversion |
| **Interfaces** | FastAPI Web UI + CLI | Flexible access methods |
| **Agent Framework** | PydanticAI | Structured RAG orchestration |Architecture Overview
┌─────────────────────────────────────────────────────────┐
│ USER INTERFACES │
│ ┌─────────────────────┐ ┌─────────────────┐ │
│ │ Web Interface │ │ CLI Interface │ │
│ │ (FastAPI + HTML) │ │ (Python async) │ │
│ └──────────┬──────────┘ └────────┬────────┘ │
└─────────────┼───────────────────────────────┼────────────┘
│ │
└─────────────┬─────────────────┘
│
┌─────────────▼───────────────────┐
│ RAG Agent Core │
│ ┌───────────────────────────┐ │
│ │ PydanticAI Agent │ │
│ │ + search_knowledge_base() │ │
│ └───────────────────────────┘ │
└─────────────┬───────────────────┘
│
┌──────────────────┼──────────────────┐
│ │ │
┌────────▼────────┐ ┌──────▼───────┐ ┌──────▼───────┐
│ Embeddings │ │ LLM │ │ PostgreSQL │
│ (Ollama) │ │ (Ollama) │ │ + PGVector │
│ [Local] │ │ [Local] │ │ [Local] │
└─────────────────┘ └──────────────┘ └──────────────┘Key Design Decisions
- Ollama over OpenAI API Running Mistral locally via Ollama means no API keys, no usage tracking, and no data leaving my machine. The trade-off is slightly slower inference, but for my use case, privacy wins.
- PGVector for Vector Storage PostgreSQL with the PGVector extension provides a battle-tested database with built-in vector similarity search. I control retention, backups, and access policies.
- Document Processing with Docling Docling handles the messy work of converting PDFs, Office documents, and HTML into clean markdown — entirely offline.
- Dual Interfaces
- CLI for quick terminal queries and SSH access
- Web UI for file uploads, visual browsing, and web crawling
What This Enables
✅ Query my internal documentation without leaving my network ✅ Process sensitive security reports with full auditability ✅ Transcribe audio recordings (via local Whisper) and search them ✅ Crawl documentation sites and keep everything local ✅ Maintain conversation context across sessions ✅ Get source citations for every response
Zero egress. Zero rate limits. Zero third-party visibility.
Is Local RAG Right for You?
Before diving into implementation, here's a quick self-assessment:
- Local RAG is a good fit if:
- You handle sensitive, confidential, or unpublished data
- You have compliance or data residency requirements
- You want full control over your AI stack
- You're comfortable managing your own infrastructure
2. Cloud APIs may suffice if:
- You're prototyping or learning
- Your data is already public
- You prioritize convenience over control
- You need state-of-the-art model capabilities
For my work in cybersecurity, local RAG wasn't just preferable. It was necessary.
Implementation Highlights
Here's a look at the core implementation details that make local RAG possible:
Database Schema
The PGVector schema stores documents and their embeddings:
-- Chunks table with 768-dimensional embeddings
CREATE TABLE chunks (
id UUID PRIMARY KEY,
document_id UUID REFERENCES documents(id),
content TEXT,
embedding vector(768),
chunk_index INTEGER
);
-- Vector similarity search function
CREATE FUNCTION match_chunks(
query_embedding vector(768),
match_count INTEGER DEFAULT 5
) RETURNS TABLE(...)Environment Configuration
# Point OpenAI-compatible clients to Ollama
OPENAI_API_KEY=ollama
OPENAI_BASE_URL=http://localhost:11434/v1
# Local models
LLM_CHOICE=mistral
EMBEDDING_MODEL=nomic-embed-text
# Local PostgreSQL
DATABASE_URL=postgresql://user:password@localhost:5432/rag_dbPerformance Optimizations
- Database connection pooling (min_size=2, max_size=10)
- Embedding cache for frequently searched queries
- Token-by-token streaming for immediate feedback
Example Workflow
Here's what querying the RAG agent looks like in practice:
$ uv run python cli.py
> "What are the authentication requirements for the client portal?"
Searching knowledge base... [3 matches found]
Based on the security assessment documentation, the client portal requires:
1. Multi-factor authentication (MFA) for all user accounts
2. Session timeout after 15 minutes of inactivity
3. Password complexity: minimum 14 characters with special characters
Sources:
- security-assessment-2025.pdf (page 12)
- client-portal-requirements.md (section 3.2)All of this happens locally — no data leaves the machine.
Trade-Offs and Limitations
Building a local RAG pipeline isn't without compromises. There are several trade-offs to consider:
- Hardware requirements: Need sufficient RAM and CPU (GPU recommended for larger models)
- Model capability: Local models may not match GPT-4 level reasoning
- Setup complexity: More initial configuration than API-based solutions
- Maintenance: You're responsible for updates, backups, and monitoring
For my threat model and use case, these trade-offs are acceptable. Your mileage may vary.
Getting Started
If you want to replicate this setup, the project is available on GitHub:
- Repository: rag-agent
- Documentation: https://n4igme.github.io/randscript/rag-agent/
Key prerequisites:
- Python 3.10+
- PostgreSQL with PGVector extension
- Ollama installed locally
- System libraries for audio processing (`libopus`, `opusfile`)
Helpful resources:
- Docling — Document processing pipeline
- PydanticAI — Agent framework
Final Thoughts
Building a local RAG pipeline taught me that privacy-preserving AI is entirely achievable today. The tools exist — Ollama, PGVector, Docling, PydanticAI—and they're mature enough for production use.
The question isn't whether local AI is capable. It's whether you're willing to invest the effort to keep your data under your control.
For me, that answer was clear: If the data shouldn't be public, it shouldn't leave my machine.
Never miss updates on what DKatalis amazing tech team works on! Follow us.