I hit a wall trying to find one specific detail buried inside hundreds of PDFs.
Even with filenames like Report_Final_v2_NEW_Latest.pdf, search tools were useless. They matched keywords, not meaning. And when you're trying to:
- Extract insights from dense research papers
- Summarize legal contracts
- Compare product specs across versions
…you need more than keyword matching. You need understanding.
So I built something smarter an AI assistant that reads, reasons, and responds. Here's how I did it, step by step.
1. Structuring the Project Like a Scalable System
Before writing any code, I mapped out the architecture. That decision helped me scale from 10 documents to 10,000 without chaos.
ai-doc-assistant/
├── ingest/
│ ├── extract_text.py
│ ├── extract_images.py
├── process/
│ ├── chunk_text.py
│ ├── embed_chunks.py
├── index/
│ └── vector_store.py
├── backend/
│ ├── qa_chain.py
│ └── server.py
├── interface/
│ └── ui.pyEach module had one job. Ingest → Process → Index → Answer → Display. Simple, clean, and easy to maintain.
2. Extracting Text with PyMuPDF
Text is the foundation. I used PyMuPDF to extract it page by page, keeping metadata intact.
import fitz
def extract_text(file_path):
doc = fitz.open(file_path)
pages = []
for i, page in enumerate(doc):
text = page.get_text()
pages.append({
"file": file_path,
"page": i + 1,
"text": text
})
return pagesThis gave me full control file names, page numbers, and selective inclusion.
3. Extracting Diagrams for Visual Context
Technical docs are full of diagrams that carry meaning. I extracted every embedded image and stored them for later embedding.
def extract_images(pdf_path, output_dir):
doc = fitz.open(pdf_path)
for page_index in range(len(doc)):
images = doc[page_index].get_images(full=True)
for img_index, img in enumerate(images):
xref = images[img_index][0]
base_image = doc.extract_image(xref)
image_bytes = base_image["image"]
image_filename = f"{output_dir}/{page_index}_{img_index}.png"
with open(image_filename, "wb") as img_file:
img_file.write(image_bytes)Later, I embedded these using CLIP and stored them alongside the text.
4. Chunking Text for RAG-Friendly Context
LLMs can't handle full documents. So I chunked them with overlap to preserve context.
def chunk_text(text, size=500, overlap=100):
chunks = []
for i in range(0, len(text), size - overlap):
chunk = text[i:i + size]
chunks.append(chunk)
return chunksThis ensured coherent answers, even across page boundaries.
5. Embedding Text with Sentence Transformers
I converted chunks into semantic vectors using sentence-transformers.
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
def embed_chunks(chunks):
return model.encode(chunks)Now I could retrieve meaning not just keywords.
6. Fast Vector Search with FAISS
I used FAISS to store and search embeddings efficiently.
import faiss
import numpy as np
def build_index(embeddings):
dim = embeddings.shape[1]
index = faiss.IndexFlatL2(dim)
index.add(embeddings)
return indexThis let me search across 10,000+ documents in under a second.
7. Multimodal Search with CLIP
To make the assistant truly multimodal, I embedded diagrams using CLIP.
from transformers import CLIPProcessor, CLIPModel
from PIL import Image
clip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
def embed_image(image_path):
image = Image.open(image_path)
inputs = processor(images=image, return_tensors="pt")
with torch.no_grad():
image_features = clip_model.get_image_features(**inputs)
return image_features.squeeze().numpy()Now it could answer queries like "show me the reboot diagram."
8. Retrieval + Reasoning with LangChain
Once I retrieved the top documents, I passed them to an LLM using LangChain.
from langchain.chains import RetrievalQA
from langchain.vectorstores import FAISS
from langchain.llms import Ollama
retriever = FAISS.load_local("index", embeddings=model)
llm = Ollama(model="mistral")
qa = RetrievalQA.from_chain_type(llm=llm, retriever=retriever)Query it:
qa.run("What's the warranty coverage of Product X?")Boom, human-like answers with citations.
9. Citing Sources for Trust
Every answer included traceable references.
def format_response(answer, docs):
refs = "\n".join([f"{doc.metadata['file']} (p{doc.metadata['page']})" for doc in docs])
return f"{answer}\n\nSources:\n{refs}"It felt more like a lawyer than a chatbot.
10. Local API with FastAPI
I wrapped the whole pipeline in a FastAPI backend.
from fastapi import FastAPI
from pydantic import BaseModel
class Query(BaseModel):
question: str
app = FastAPI()
@app.post("/ask")
def ask(query: Query):
response = qa.run(query.question)
return {"answer": response}Now it could plug into anything Slack, web apps, even voice.
11. Web UI with Gradio
To demo it, I built a simple Gradio interface.
import gradio as gr
def answer_question(q):
return qa.run(q)
gr.Interface(fn=answer_question, inputs="text", outputs="text", title="AI PDF Assistant").launch()Clients asked real questions and the bot nailed it.
12. Live File Uploads
Users could drop new PDFs into the UI, and they were instantly processed.
@app.post("/upload")
async def upload_pdf(file: UploadFile):
save_path = f"./docs/{file.filename}"
with open(save_path, "wb") as f:
f.write(await file.read())
# Extract, chunk, embed, and update FAISS indexNew docs = new knowledge. No manual updates.
13. Local Deployment with Docker + Ollama
I containerized everything for local deployment.
FROM python:3.10
WORKDIR /app
COPY . .
RUN pip install -r requirements.txt
CMD ["uvicorn", "backend.server:app", "--host", "0.0.0.0", "--port", "8000"]Now it runs on any laptop no cloud dependency.
14. Long-Term Memory with SQLite
I added memory to track interactions.
import sqlite3
def save_interaction(question, answer):
conn = sqlite3.connect("history.db")
conn.execute("INSERT INTO log (q, a) VALUES (?, ?)", (question, answer))
conn.commit()Later, I used this to improve responses and retrain the model.
Final Shot: From Search to Understanding
This wasn't just search. It was reasoning across documents, diagrams, and metadata.
It could:
- Compare product specs
- Summarize manuals
- Find visuals from paragraphs
- Deliver personalized answers
I didn't just build a bot. I built an AI teammate.
A message from our Founder
Hey, Sunil here. I wanted to take a moment to thank you for reading until the end and for being a part of this community.
Did you know that our team run these publications as a volunteer effort to over 3.5m monthly readers? We don't receive any funding, we do this to support the community. ❤️
If you want to show some love, please take a moment to follow me on LinkedIn, TikTok, Instagram. You can also subscribe to our weekly newsletter.
And before you go, don't forget to clap and follow the writer️!