A silent failure broke our document pipeline for days. Nobody noticed. Here's how AI found the problem, traced its origin, and helped us recover everything.
This is the second article in a series about building traceability systems for industrial operations. In the first piece, I talked about the one skill AI can't replace — finding the pain. Today, I want to tell you about something AI is surprisingly good at: playing detective when things go wrong in production.
The System That Was Already Working
Quick context: we're building an ERP system for a German recycling conglomerate that processes thousands of tons of non-ferrous metals. One of the workflows we built handles credit notes — the supplier sends a PDF invoice by email, our automation pipeline picks it up, an AI model reads and transcribes the document, and a credit note is created in the ERP with the PDF attached.
It was working. It had been tested. It was in the "done" column.
Until it wasn't.
How We Found Out (By Accident)
We're in the middle of user acceptance testing. Real users, real data, real workflows. Features have been accumulating — the system is getting more capable every week. One of those features is particularly interesting: when a credit note is imported, it can reference materials that don't yet exist in the database. The system creates them automatically, but marks the credit note as incomplete until a human reviews it.
So a user goes to review one of these incomplete credit notes. They want to verify that the AI transcribed the original PDF correctly. They click to open the attachment.
Nothing. The PDF isn't there.
They try another credit note. Same thing. And another. Some have attachments that appear to be there but are corrupted — click and nothing renders.
That's when the call came in.
The AI Detective
Here's where it gets interesting. Instead of manually digging through logs, deployment histories, and N8N workflow executions, we brought in AI.
The task was simple: figure out what happened, when it started, and how bad it is.
The AI worked almost like a detective on a case:
Establishing the timeline. It analyzed the credit notes in the database and identified the exact date when attachments stopped being properly saved. Not approximately — exactly. Every credit note before that date had a valid PDF. Every one after was either missing the attachment or had a corrupted file.
Correlating with deployments. That date matched something: our migration to N8N 2.0. The platform upgrade had silently broken the PDF serialization in our email processing workflow. The documents were being fetched but not properly encoded when passed to the ERP.
Pinpointing the failure. The AI didn't just say "something broke in N8N." It identified the specific node in the workflow where the serialization was failing. That precision meant we didn't waste hours debugging the entire pipeline — we went straight to the point of failure and fixed it.
The Fix Was Only Half the Story
Great, we found the bug and patched it. New credit notes would process correctly going forward.
But what about the 200+ credit notes that were already broken?
The short answer: yes, we could recover them. The long answer came with a significant asterisk.
Recovery meant reprocessing every affected credit note in bulk — going back to the original emails, re-extracting the PDFs through the corrected workflow, and reattaching them to the right records. But mass reprocessing is never clean. Some attachments would end up duplicated. Each recovered note needed validation. The margin for introducing new errors while fixing old ones was real.
We worked through this methodically, using MCP (Model Context Protocol) connections to both Odoo and N8N. The AI helped us scope the reprocessing, identify edge cases, and validate results. Step by step, credit note by credit note when necessary.
By the end of the day, the database was clean. Every credit note had its correct PDF attached. The books were in order.
The Bitter Lesson
Let's start with what went wrong, because it matters more than what went right.
A critical failure ran undetected for days. The pipeline was silently producing broken output, and nobody noticed until a user happened to click on the wrong attachment. That's not a technology failure — that's a monitoring failure.
We had no alerts for attachment integrity. No automated checks comparing "emails processed" against "valid PDFs attached." No dashboard showing the health of this specific workflow. The system was technically running — N8N reported no errors, the ERP showed credit notes being created — but the output was garbage.
Mental note: monitoring is not optional. If a workflow produces output that humans depend on, there needs to be a way to verify that output is correct — not just that the workflow executed.
The Sweet Lesson
Every day, AI proves itself a better ally in ways I didn't expect. I had faith in its ability to write code, generate modules, process documents. But detective work? Correlating dates with deployments, tracing failures through multi-system pipelines, scoping the blast radius of a silent bug?
That surprised me.
The MCP connections were the real superpower here. Having AI that can simultaneously query the ERP database, inspect N8N workflow executions, and cross-reference the results — that's not just convenient, it's a fundamentally different way of debugging. Instead of a developer jumping between three browser tabs and a terminal, the AI sees everything at once and draws connections a human might miss.
There's a new development in this space — MCP Apps — that promises to take this even further. But that's a story for another day.
The Takeaway
Production systems fail silently. That's not news. What's new is having an AI that can reconstruct the crime scene after the fact — identify when things broke, why they broke, and what the damage is — in a fraction of the time it would take a human.
But the real takeaway is simpler: don't wait for a user to find your bugs. Build the monitoring first. Because the best detective is the one you never need to call.
Next in this series: how MCP connections are changing the way we interact with complex systems — and why the integration layer is becoming more important than the systems themselves.
— Omar