From Files to AI-Ready Data (Upload & Text Extraction)

Yesterday I realized something uncomfortable:

My AI app was smart… but it knew nothing about my data.

It couldn't read PDFs. It couldn't answer questions about documents. It was basically… disconnected.

Today's Mission

Take a file → convert it into text → make it usable for AI

Because:

AI does NOT understand files ❌  
AI understands TEXT ✅

Step 1 — File Upload System

We implemented a backend API:


import * as pdfjsLib from "pdfjs-dist/legacy/build/pdf.mjs";

export const uploadFile = async (req, res) => {
  try {
    const file = req.file;

    if (!file) {
      return res.status(400).json({ message: "No file uploaded" });
    }

    let extractedText = "";

    if (file.mimetype === "application/pdf") {
      const loadingTask = pdfjsLib.getDocument({
        data: new Uint8Array(file.buffer),
        useWorkerFetch: false,
        isEvalSupported: false,
        useSystemFonts: true,
      });

      const pdf = await loadingTask.promise;

      let textContent = "";

      for (let i = 1; i <= pdf.numPages; i++) {
        const page = await pdf.getPage(i);
        const content = await page.getTextContent();

        const pageText = content.items
          .map(item => item.str)
          .join(" ");

        textContent += pageText + "\n";
      }

      extractedText = textContent;
    } else {
      extractedText = file.buffer.toString();
    }

    res.status(200).json({
      message: "File processed successfully",
      text: extractedText,
    });

  } catch (error) {
    console.error("PDF ERROR:", error);
    res.status(500).json({
      message: "Error processing file",
    });
  }
};

What this controller does

Receives file → converts buffer → extracts text → returns usable content

Step 2 — The Real Work: Text Extraction

This is the most important part:

PDF → binary → text

This step was NOT smooth.

❌ Attempt 1 — pdf-parse

Problem:
- ESM vs CommonJS conflict
- import/export issues
- require not allowed

👉 Result:

Failed ❌

❌ Attempt 2 — Workarounds

Tried:

  • default import fixes
  • dynamic imports

👉 New errors:

"does not provide default export"
"require is not defined"

❌ Attempt 3 — More Errors

Even when partially working:

Worker issues  
Binary format issues  
Node compatibility issues

Key Learning

Not all libraries play nicely with modern Node.js (ESM)

Final Solution — pdfjs-dist

We switched approach:

Use pdfjs-dist (Mozilla PDF engine)

Why it worked

+----------------------+--------------------------------------+
| Feature              | Benefit                              |
+----------------------+--------------------------------------+
| ESM support          | Works with modern Node               |
| Stable               | Used in browsers                     |
| Page-level parsing   | Better control                       |
+----------------------+--------------------------------------+

⚠️ But even here… challenges 😄

❌ Error 1 — Module not found

pdf.js → changed to pdf.mjs

❌ Error 2 — Buffer issue

Expected Uint8Array, got Buffer

👉 Fix:

new Uint8Array(file.buffer)

❌ Error 3 — Worker crash

"Only URLs with file/data/node supported"

👉 Cause:

Worker was trying to load from CDN ❌

👉 Fix:

Disable worker in Node environment ✅

Final Working Flow

PDF → Uint8Array → pdfjs → extract text → done ✅

🧩 Frontend (Basic Attachment UI)

I also added a simple UI:

  • Attach file 📎
  • Show file name
  • Remove file option

⚠️ Important Note

This UI is currently just for display purpose

👉 Meaning:

File is uploaded + text extracted  
BUT not yet used in AI responses ❌

Why this matters

Right now we have:

Data ingestion layer ✅

But not yet:

Intelligence layer ❌

🚀 Architecture So Far

User uploads file
        ↓
Backend extracts text
        ↓
Frontend stores text
        ↓
Ready for next step ✅

🧠 Key Learnings

  • Files are just containers
  • Text is what AI understands
  • Understood how messy real-world integrations can be 😄

🔗 Live Demo & Code

👉 Live App: https://ai-chat-app-learning.netlify.app

👉 GitHub Repo: https://github.com/RohitKuwar/ai-chat-app/tree/feature/file-upload-api

🚀 What's Next

We now have raw text…

But AI still can't use it effectively.

Tomorrow I'll cover: Chunking — breaking large text into small, meaningful pieces

This is where RAG actually begins.

I'll be posting here daily as I learn. Let's grow together and stay ahead in the AI era 🚀

Happy learning ✌️