Use Postgres and pgvector for production-grade retrieval and RAG. Learn schema design, indexes, and architecture patterns that scale on a startup budget.

The first time someone suggested pgvector to me, I shrugged.

"Cute for prototypes. We'll use a real vector database in production."

A year later, we were still arguing about which vector database to standardize on… while our poor little Postgres cluster quietly powered most of the real workloads.

Here's the thing no one wants to admit in vendor decks: Postgres + pgvector is not a toy.

If you treat it like a first-class citizen — good schema, sane indexes, thoughtful architecture — it will happily serve millions of embeddings, sub-100ms queries, and RAG workloads on a very normal hardware bill.

Let's walk through how.

Why Postgres + pgvector Is More Serious Than It Looks

Many teams dismiss pgvector because they've seen it used like this:

  • single table, no index, 200k rows
  • cosine distance on CPU with no tuning
  • random VACUUM schedule

Of course it looks slow. That's not pgvector's fault; that's "demo-driven development."

In reality, Postgres gives you three unfair advantages:

  1. Mature reliability story — backups, replicas, monitoring… your SRE team already knows the playbook.
  2. Transactional semantics — you can keep embeddings in sync with the data they represent.
  3. Shared infra — you avoid spinning a whole new cluster and operational surface area just for vectors.

pgvector layers vector types and indexes on top of that. Treat it like any other Postgres extension: learn its knobs, design for your workload, and it will behave.

The Minimal Production Schema

Let's ground this in something concrete.

Imagine you're building retrieval for support articles and past tickets. A pragmatic schema:

CREATE TABLE documents (
  id           BIGSERIAL PRIMARY KEY,
  external_id  TEXT,
  source       TEXT,              -- 'kb', 'ticket', 'email'
  title        TEXT,
  body         TEXT,
  metadata     JSONB,
  created_at   TIMESTAMPTZ DEFAULT now()
);

-- chunked text for retrieval
CREATE TABLE document_chunks (
  id           BIGSERIAL PRIMARY KEY,
  document_id  BIGINT REFERENCES documents(id) ON DELETE CASCADE,
  chunk_index  INT,
  content      TEXT,
  embedding    VECTOR(1536),      -- match your model dimension
  created_at   TIMESTAMPTZ DEFAULT now()
);

Two core ideas here:

  • Separate documents from chunks. You search chunks but display documents.
  • Store embeddings alongside the chunk. No extra "embeddings" microservice needed.

Then you add pgvector indexes. For example, IVF index on cosine distance:

-- Enable extension once per database
CREATE EXTENSION IF NOT EXISTS vector;

-- Choose an index type: ivfflat is a good default
CREATE INDEX ON document_chunks
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);  -- tweak based on dataset size

Yes, this is ready for production… as long as you understand what you just built.

How the Query Path Actually Works

A typical retrieval query looks like this:

-- $1 is the query embedding (vector)
SELECT  document_id,
        chunk_index,
        content,
        1 - (embedding <=> $1) AS similarity
FROM    document_chunks
WHERE   source = 'kb'
ORDER BY embedding <-> $1
LIMIT 10;

A few details worth noting:

  • <-> and <=> are distance operators; lower is closer.
  • For cosine similarity, you often transform distance into "similarity" for easier ranking.
  • Narrowing by filters (source, created_at, metadata->'lang') is where Postgres shines. Normal B-tree indexes still work.

From your app, you rarely hit this query directly. You wrap it in a retrieval service:

User query
   │
   ▼
+----------------+        +-------------------+
|  Embed Query   | -----> |  Postgres + pgvec |
+----------------+        +-------------------+
   │                        ▲
   ▼                        │
Rerank / format     document_chunks + documents
   │
   ▼
Return top-N passages

That's it. No separate vector DB, no extra persistence layer.

Architecture: "One Database, Two Faces"

A simple (and surprisingly scalable) pattern for startups is:

+----------------+
              |  Application   |
              |  (API, UI)     |
              +--------+-------+
                       |
                       v
          +------------+-------------+
          | Retrieval / RAG Service  |
          +------------+-------------+
                       |
          +------------+-------------+
          |     Postgres Cluster     |
          |  OLTP + pgvector schema  |
          +--------------------------+

For heavier workloads, you evolve into:

Writes                    Reads
+------------------+       +--------------------+
|  Primary DB      | <---- |  Read Replica(s)   |
|  (OLTP, ingest)  |       |  (Retrieval-heavy) |
+------------------+       +--------------------+
            ^                        ^
            |                        |
       ETL / streaming          RAG / search

You offload vector-heavy queries to read replicas while keeping transactional updates on the primary. Postgres has had this pattern nailed for years; pgvector just rides along.

Performance Tuning Without Losing Your Weekend

You don't need a PhD to make pgvector fast, but you do need a bit of tuning.

1. Choose the right index type

  • ivfflat – good default for large collections, approximate search, configurable recall vs speed.
  • hnsw (if available in your version) – great recall, often excellent latency, slightly heavier writes.
  • No index — acceptable only for <100k rows or offline jobs.

Rule of thumb: if your p95 query time > 150 ms with ivfflat, consider:

  • increasing lists (more buckets)
  • using probes parameter at query time for better recall
SET ivfflat.probes = 10; -- higher = more accurate, slower

2. Keep embeddings dense, not massive

Don't store 4,096-dim embeddings "just because the model offered them." For many workloads, 768 or 1,536 dimensions are fine and significantly cheaper to compute and store.

3. Watch IO, not just CPU

Large embeddings mean big rows. Use:

  • TOAST-aware schema: keep text columns reasonable, offload raw HTML elsewhere.
  • Fillfactor/partitioning: especially if you're frequently deleting/refreshing chunks.

If your bottleneck becomes disk IO, a surprisingly effective trick is separate tablespaces (fast disk for document_chunks, cheaper disk for everything else).

A Quick Case Study: Startup Support Search

A small B2B SaaS team I know had:

  • ~3M support email threads
  • ~50k knowledge base articles
  • ~3 support engineers drowning in copy-paste

They considered a dedicated vector DB, but every option added new infra and new failure modes. Instead they:

  1. Added pgvector to their existing managed Postgres.
  2. Nightly ETL to chunk and embed new tickets + articles.
  3. A small retrieval API backing:
  • "similar tickets" in their support tool
  • AI-generated reply suggestions
  • a customer-facing semantic search bar

Hardware: one beefy primary, two read replicas. Results:

  • Median retrieval latency: ~70–90 ms.
  • Support handle time dropped because agents started from better context.
  • Total infra bill lower than the cost of most standalone managed vector DBs at that scale.

Did they eventually outgrow it? Maybe. But it carried them through the first meaningful two years of AI features with zero "we lost the vector cluster" incidents.

When Postgres + pgvector Is the Wrong Tool

Let's be honest: sometimes you do need a dedicated vector database. Reasons might include:

  • Billions of embeddings with strict low-latency SLAs.
  • Multi-tenant, cross-region deployments where per-tenant isolation is more important than reuse.
  • You want built-in hybrid search, document ingestion pipelines, and analytics features out of the box.

The point isn't "never use a vector DB." It's: don't start there if you already run Postgres and your problem fits comfortably within its envelope.

Getting Started Without Regretting It Later

If you're pgvector-curious, here's a practical starting checklist:

  1. Enable pgvector in a dev database and create a document_chunks table with real data.
  2. Run a backfill of embeddings for ~100k–500k chunks.
  3. Add an ivfflat index and measure:
  • p50/p95 query latency
  • memory and CPU impact

4. Wire a minimal retrieval API, put it behind a feature flag, and dogfood internally.

5. Only after that decide whether you truly need a separate vector infra.

You'll learn more from a week of realistic pgvector experiments than from a month of reading vendor benchmarks.

Closing: The Boring Stack That Wins

In a hype cycle, it's tempting to collect new databases like Pokémon.

But the stack that quietly wins is often the boring one:

  • Postgres
  • pgvector
  • A small retrieval service
  • Good indexes
  • Clear observability

If you're a startup or a team on a budget, you can ship serious retrieval and RAG features today using the database you already know how to run.

And if you are running Postgres + pgvector in production, I'd love to hear your war stories — good, bad, or "we accidentally stored 10x more dimensions than we needed."

👇 Drop a comment with your setup, follow for more pragmatic AI infra pieces, or share this with the teammate currently writing an RFP for three different vector databases.