From Files to AI-Ready Data (Upload & Text Extraction)
Yesterday I realized something uncomfortable:
My AI app was smart… but it knew nothing about my data.
It couldn't read PDFs. It couldn't answer questions about documents. It was basically… disconnected.
Today's Mission
Take a file → convert it into text → make it usable for AIBecause:
AI does NOT understand files ❌
AI understands TEXT ✅Step 1 — File Upload System
We implemented a backend API:
import * as pdfjsLib from "pdfjs-dist/legacy/build/pdf.mjs";
export const uploadFile = async (req, res) => {
try {
const file = req.file;
if (!file) {
return res.status(400).json({ message: "No file uploaded" });
}
let extractedText = "";
if (file.mimetype === "application/pdf") {
const loadingTask = pdfjsLib.getDocument({
data: new Uint8Array(file.buffer),
useWorkerFetch: false,
isEvalSupported: false,
useSystemFonts: true,
});
const pdf = await loadingTask.promise;
let textContent = "";
for (let i = 1; i <= pdf.numPages; i++) {
const page = await pdf.getPage(i);
const content = await page.getTextContent();
const pageText = content.items
.map(item => item.str)
.join(" ");
textContent += pageText + "\n";
}
extractedText = textContent;
} else {
extractedText = file.buffer.toString();
}
res.status(200).json({
message: "File processed successfully",
text: extractedText,
});
} catch (error) {
console.error("PDF ERROR:", error);
res.status(500).json({
message: "Error processing file",
});
}
};What this controller does
Receives file → converts buffer → extracts text → returns usable contentStep 2 — The Real Work: Text Extraction
This is the most important part:
PDF → binary → textThis step was NOT smooth.
❌ Attempt 1 — pdf-parse
Problem:
- ESM vs CommonJS conflict
- import/export issues
- require not allowed👉 Result:
Failed ❌❌ Attempt 2 — Workarounds
Tried:
- default import fixes
- dynamic imports
👉 New errors:
"does not provide default export"
"require is not defined"❌ Attempt 3 — More Errors
Even when partially working:
Worker issues
Binary format issues
Node compatibility issuesKey Learning
Not all libraries play nicely with modern Node.js (ESM)Final Solution — pdfjs-dist
We switched approach:
Use pdfjs-dist (Mozilla PDF engine)Why it worked
+----------------------+--------------------------------------+
| Feature | Benefit |
+----------------------+--------------------------------------+
| ESM support | Works with modern Node |
| Stable | Used in browsers |
| Page-level parsing | Better control |
+----------------------+--------------------------------------+⚠️ But even here… challenges 😄
❌ Error 1 — Module not found
pdf.js → changed to pdf.mjs❌ Error 2 — Buffer issue
Expected Uint8Array, got Buffer👉 Fix:
new Uint8Array(file.buffer)❌ Error 3 — Worker crash
"Only URLs with file/data/node supported"👉 Cause:
Worker was trying to load from CDN ❌👉 Fix:
Disable worker in Node environment ✅Final Working Flow
PDF → Uint8Array → pdfjs → extract text → done ✅🧩 Frontend (Basic Attachment UI)
I also added a simple UI:
- Attach file 📎
- Show file name
- Remove file option
⚠️ Important Note
This UI is currently just for display purpose👉 Meaning:
File is uploaded + text extracted
BUT not yet used in AI responses ❌Why this matters
Right now we have:
Data ingestion layer ✅But not yet:
Intelligence layer ❌🚀 Architecture So Far
User uploads file
↓
Backend extracts text
↓
Frontend stores text
↓
Ready for next step ✅🧠 Key Learnings
- Files are just containers
- Text is what AI understands
- Understood how messy real-world integrations can be 😄
🔗 Live Demo & Code
👉 Live App: https://ai-chat-app-learning.netlify.app
👉 GitHub Repo: https://github.com/RohitKuwar/ai-chat-app/tree/feature/file-upload-api
🚀 What's Next
We now have raw text…
But AI still can't use it effectively.
Tomorrow I'll cover: Chunking — breaking large text into small, meaningful pieces
This is where RAG actually begins.
I'll be posting here daily as I learn. Let's grow together and stay ahead in the AI era 🚀
Happy learning ✌️