Privacy-Preserving AI: My Journey to a Self-Hosted RAG Pipeline

Building a fully local RAG pipeline to keep sensitive data off cloud APIs without sacrificing capability.

M. Habib

DKatalis

· ~6 min read · April 14, 2026 (Updated: April 14, 2026) · Free: Yes

The Privacy Dilemma in the Age of AI

Today, interacting with AI feels seamless: type a question, receive an insightful answer. But for those of us working with sensitive technical documentation, organizational confidential information, or proprietary research, that convenience carries a hidden cost: data exposure.

If I can't verify where my data goes, how it's stored, or who might access it, should I share it?

Every prompt sent to a cloud-based LLM introduces risk. Even with strong privacy policies, you're ultimately trusting a third party with:

The raw content of your documents
Your query patterns and areas of intellectual focus
Metadata that could reveal project timelines, system architectures, or security postures

As someone who regularly handles confidential materials and unpublished technical writing, I couldn't accept that trade-off. This isn't me being paranoid; it's a documented compliance and security concern grounded in real-world threat models.

The Case for Data Sovereignty

Trust Boundaries Matter

Retrieval-Augmented Generation (RAG) enhances LLMs by connecting them to external knowledge sources. While RAG is privacy-respecting by design (keeping data separate from model weights), the deployment architecture ultimately determines whether your information stays truly private.

Cloud AI APIs offer tremendous power, but they operate outside your control perimeter. When you upload a document to a managed RAG service, you implicitly accept that:

You don't control data retention or deletion policies
You can't audit how embeddings are generated, stored, or potentially reused
You rely on the provider's security posture, not your own

For my work in cybersecurity, that lack of visibility wasn't just uncomfortable; it was a hard blocker.

Compliance, Control, and Offline Access

Beyond personal preference, there are practical scenarios where local AI is essential:

Data residency requirements: Some projects mandate that data never leaves a specific machine or network
Air-gapped environments: Security research often happens on isolated systems
Long-term reproducibility: Local models and vectors ensure your RAG pipeline behaves consistently over time
Cost predictability: No API calls means no surprise bills or rate limits

My Personal Threshold

I'm not anti-cloud. I use managed services when they make sense. But for knowledge work involving sensitive or unpublished material, my rule is simple:

"If the data shouldn't be public, it shouldn't leave my machine."

That principle drove every architectural decision in my RAG pipeline.

The Solution: A Local-First RAG Architecture

Here's the stack I assembled to keep intelligence local without sacrificing capability:

Core Components

| Component | Technology | Purpose |
|-----------|------------|---------|
| **LLM** | Ollama (Mistral) | Local inference, zero API calls |
| **Embeddings** | Ollama (nomic-embed-text) | 768-dimensional vectors, generated locally |
| **Vector Store** | PostgreSQL + PGVector | Persistent, queryable document storage |
| **Document Processing** | Docling | PDF, Word, PowerPoint, Excel, HTML conversion |
| **Interfaces** | FastAPI Web UI + CLI | Flexible access methods |
| **Agent Framework** | PydanticAI | Structured RAG orchestration |

Architecture Overview


┌─────────────────────────────────────────────────────────┐
│                      USER INTERFACES                    │
│  ┌─────────────────────┐           ┌─────────────────┐  │
│  │   Web Interface     │           │   CLI Interface │  │
│  │   (FastAPI + HTML)  │           │  (Python async) │  │
│  └──────────┬──────────┘           └────────┬────────┘  │
└─────────────┼───────────────────────────────┼────────────┘
              │                               │
              └─────────────┬─────────────────┘
                            │
              ┌─────────────▼───────────────────┐
              │       RAG Agent Core            │
              │  ┌───────────────────────────┐  │
              │  │ PydanticAI Agent          │  │
              │  │ + search_knowledge_base() │  │
              │  └───────────────────────────┘  │
              └─────────────┬───────────────────┘
                            │
         ┌──────────────────┼──────────────────┐
         │                  │                  │
┌────────▼────────┐  ┌──────▼───────┐  ┌──────▼───────┐
│  Embeddings     │  │    LLM       │  │  PostgreSQL  │
│  (Ollama)       │  │  (Ollama)    │  │  + PGVector  │
│  [Local]        │  │  [Local]     │  │  [Local]     │
└─────────────────┘  └──────────────┘  └──────────────┘

Key Design Decisions

Ollama over OpenAI API Running Mistral locally via Ollama means no API keys, no usage tracking, and no data leaving my machine. The trade-off is slightly slower inference, but for my use case, privacy wins.
PGVector for Vector Storage PostgreSQL with the PGVector extension provides a battle-tested database with built-in vector similarity search. I control retention, backups, and access policies.
Document Processing with Docling Docling handles the messy work of converting PDFs, Office documents, and HTML into clean markdown — entirely offline.
Dual Interfaces

CLI for quick terminal queries and SSH access
Web UI for file uploads, visual browsing, and web crawling

What This Enables

✅ Query my internal documentation without leaving my network ✅ Process sensitive security reports with full auditability ✅ Transcribe audio recordings (via local Whisper) and search them ✅ Crawl documentation sites and keep everything local ✅ Maintain conversation context across sessions ✅ Get source citations for every response

Zero egress. Zero rate limits. Zero third-party visibility.

Is Local RAG Right for You?

Before diving into implementation, here's a quick self-assessment:

Local RAG is a good fit if:

You handle sensitive, confidential, or unpublished data
You have compliance or data residency requirements
You want full control over your AI stack
You're comfortable managing your own infrastructure

2. Cloud APIs may suffice if:

You're prototyping or learning
Your data is already public
You prioritize convenience over control
You need state-of-the-art model capabilities

For my work in cybersecurity, local RAG wasn't just preferable. It was necessary.

Implementation Highlights

Here's a look at the core implementation details that make local RAG possible:

Database Schema

The PGVector schema stores documents and their embeddings:

-- Chunks table with 768-dimensional embeddings
CREATE TABLE chunks (
    id UUID PRIMARY KEY,
    document_id UUID REFERENCES documents(id),
    content TEXT,
    embedding vector(768),
    chunk_index INTEGER
);

-- Vector similarity search function
CREATE FUNCTION match_chunks(
    query_embedding vector(768),
    match_count INTEGER DEFAULT 5
) RETURNS TABLE(...)

Environment Configuration

# Point OpenAI-compatible clients to Ollama
OPENAI_API_KEY=ollama
OPENAI_BASE_URL=http://localhost:11434/v1

# Local models
LLM_CHOICE=mistral
EMBEDDING_MODEL=nomic-embed-text

# Local PostgreSQL
DATABASE_URL=postgresql://user:password@localhost:5432/rag_db

Performance Optimizations

Database connection pooling (min_size=2, max_size=10)
Embedding cache for frequently searched queries
Token-by-token streaming for immediate feedback

Example Workflow

Here's what querying the RAG agent looks like in practice:

$ uv run python cli.py

> "What are the authentication requirements for the client portal?"

Searching knowledge base... [3 matches found]

Based on the security assessment documentation, the client portal requires:

1. Multi-factor authentication (MFA) for all user accounts
2. Session timeout after 15 minutes of inactivity
3. Password complexity: minimum 14 characters with special characters

Sources:
- security-assessment-2025.pdf (page 12)
- client-portal-requirements.md (section 3.2)

All of this happens locally — no data leaves the machine.

Trade-Offs and Limitations

Building a local RAG pipeline isn't without compromises. There are several trade-offs to consider:

Hardware requirements: Need sufficient RAM and CPU (GPU recommended for larger models)
Model capability: Local models may not match GPT-4 level reasoning
Setup complexity: More initial configuration than API-based solutions
Maintenance: You're responsible for updates, backups, and monitoring

For my threat model and use case, these trade-offs are acceptable. Your mileage may vary.

Getting Started

If you want to replicate this setup, the project is available on GitHub:

Repository: rag-agent
Documentation: https://n4igme.github.io/randscript/rag-agent/

Key prerequisites:

Python 3.10+
PostgreSQL with PGVector extension
Ollama installed locally
System libraries for audio processing (`libopus`, `opusfile`)

Helpful resources:

Docling — Document processing pipeline
PydanticAI — Agent framework

Final Thoughts

Building a local RAG pipeline taught me that privacy-preserving AI is entirely achievable today. The tools exist — Ollama, PGVector, Docling, PydanticAI—and they're mature enough for production use.

The question isn't whether local AI is capable. It's whether you're willing to invest the effort to keep your data under your control.

For me, that answer was clear: If the data shouldn't be public, it shouldn't leave my machine.

Never miss updates on what DKatalis amazing tech team works on! Follow us.

#artificial-intelligence #large-language-models #retrieval-augmented-gen #technology #security