I hit a wall trying to find one specific detail buried inside hundreds of PDFs. Even with filenames like Report_Final_v2_NEW_Latest.pdf, search tools were useless. They matched keywords, not meaning. And when you're trying to:

  • Extract insights from dense research papers
  • Summarize legal contracts
  • Compare product specs across versions

…you need more than keyword matching. You need understanding.

So I built something smarter an AI assistant that reads, reasons, and responds. Here's how I did it, step by step.

1. Structuring the Project Like a Scalable System

Before writing any code, I mapped out the architecture. That decision helped me scale from 10 documents to 10,000 without chaos.

ai-doc-assistant/
├── ingest/
│   ├── extract_text.py
│   ├── extract_images.py
├── process/
│   ├── chunk_text.py
│   ├── embed_chunks.py
├── index/
│   └── vector_store.py
├── backend/
│   ├── qa_chain.py
│   └── server.py
├── interface/
│   └── ui.py

Each module had one job. Ingest → Process → Index → Answer → Display. Simple, clean, and easy to maintain.

2. Extracting Text with PyMuPDF

Text is the foundation. I used PyMuPDF to extract it page by page, keeping metadata intact.

import fitz
def extract_text(file_path):
    doc = fitz.open(file_path)
    pages = []
    for i, page in enumerate(doc):
        text = page.get_text()
        pages.append({
            "file": file_path,
            "page": i + 1,
            "text": text
        })
    return pages

This gave me full control file names, page numbers, and selective inclusion.

3. Extracting Diagrams for Visual Context

Technical docs are full of diagrams that carry meaning. I extracted every embedded image and stored them for later embedding.

def extract_images(pdf_path, output_dir):
    doc = fitz.open(pdf_path)
    for page_index in range(len(doc)):
        images = doc[page_index].get_images(full=True)
        for img_index, img in enumerate(images):
            xref = images[img_index][0]
            base_image = doc.extract_image(xref)
            image_bytes = base_image["image"]
            image_filename = f"{output_dir}/{page_index}_{img_index}.png"
            with open(image_filename, "wb") as img_file:
                img_file.write(image_bytes)

Later, I embedded these using CLIP and stored them alongside the text.

4. Chunking Text for RAG-Friendly Context

LLMs can't handle full documents. So I chunked them with overlap to preserve context.

def chunk_text(text, size=500, overlap=100):
    chunks = []
    for i in range(0, len(text), size - overlap):
        chunk = text[i:i + size]
        chunks.append(chunk)
    return chunks

This ensured coherent answers, even across page boundaries.

5. Embedding Text with Sentence Transformers

I converted chunks into semantic vectors using sentence-transformers.

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
def embed_chunks(chunks):
    return model.encode(chunks)

Now I could retrieve meaning not just keywords.

6. Fast Vector Search with FAISS

I used FAISS to store and search embeddings efficiently.

import faiss
import numpy as np
def build_index(embeddings):
    dim = embeddings.shape[1]
    index = faiss.IndexFlatL2(dim)
    index.add(embeddings)
    return index

This let me search across 10,000+ documents in under a second.

7. Multimodal Search with CLIP

To make the assistant truly multimodal, I embedded diagrams using CLIP.

from transformers import CLIPProcessor, CLIPModel
from PIL import Image
clip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
def embed_image(image_path):
    image = Image.open(image_path)
    inputs = processor(images=image, return_tensors="pt")
    with torch.no_grad():
        image_features = clip_model.get_image_features(**inputs)
    return image_features.squeeze().numpy()

Now it could answer queries like "show me the reboot diagram."

8. Retrieval + Reasoning with LangChain

Once I retrieved the top documents, I passed them to an LLM using LangChain.

from langchain.chains import RetrievalQA
from langchain.vectorstores import FAISS
from langchain.llms import Ollama
retriever = FAISS.load_local("index", embeddings=model)
llm = Ollama(model="mistral")
qa = RetrievalQA.from_chain_type(llm=llm, retriever=retriever)

Query it:

qa.run("What's the warranty coverage of Product X?")

Boom, human-like answers with citations.

9. Citing Sources for Trust

Every answer included traceable references.

def format_response(answer, docs):
    refs = "\n".join([f"{doc.metadata['file']} (p{doc.metadata['page']})" for doc in docs])
    return f"{answer}\n\nSources:\n{refs}"

It felt more like a lawyer than a chatbot.

10. Local API with FastAPI

I wrapped the whole pipeline in a FastAPI backend.

from fastapi import FastAPI
from pydantic import BaseModel
class Query(BaseModel):
    question: str
app = FastAPI()
@app.post("/ask")
def ask(query: Query):
    response = qa.run(query.question)
    return {"answer": response}

Now it could plug into anything Slack, web apps, even voice.

11. Web UI with Gradio

To demo it, I built a simple Gradio interface.

import gradio as gr
def answer_question(q):
    return qa.run(q)
gr.Interface(fn=answer_question, inputs="text", outputs="text", title="AI PDF Assistant").launch()

Clients asked real questions and the bot nailed it.

12. Live File Uploads

Users could drop new PDFs into the UI, and they were instantly processed.

@app.post("/upload")
async def upload_pdf(file: UploadFile):
    save_path = f"./docs/{file.filename}"
    with open(save_path, "wb") as f:
        f.write(await file.read())
    # Extract, chunk, embed, and update FAISS index

New docs = new knowledge. No manual updates.

13. Local Deployment with Docker + Ollama

I containerized everything for local deployment.

FROM python:3.10
WORKDIR /app
COPY . .
RUN pip install -r requirements.txt
CMD ["uvicorn", "backend.server:app", "--host", "0.0.0.0", "--port", "8000"]

Now it runs on any laptop no cloud dependency.

14. Long-Term Memory with SQLite

I added memory to track interactions.

import sqlite3
def save_interaction(question, answer):
    conn = sqlite3.connect("history.db")
    conn.execute("INSERT INTO log (q, a) VALUES (?, ?)", (question, answer))
    conn.commit()

Later, I used this to improve responses and retrain the model.

Final Shot: From Search to Understanding

This wasn't just search. It was reasoning across documents, diagrams, and metadata.

It could:

  • Compare product specs
  • Summarize manuals
  • Find visuals from paragraphs
  • Deliver personalized answers

I didn't just build a bot. I built an AI teammate.

A message from our Founder

Hey, Sunil here. I wanted to take a moment to thank you for reading until the end and for being a part of this community.

Did you know that our team run these publications as a volunteer effort to over 3.5m monthly readers? We don't receive any funding, we do this to support the community. ❤️

If you want to show some love, please take a moment to follow me on LinkedIn, TikTok, Instagram. You can also subscribe to our weekly newsletter.

And before you go, don't forget to clap and follow the writer️!