Teaching an AI Agent to Know Your Microservices: Markdown, RAG, and Graph Knowledge Bases

A beginner-friendly guide to knowledge bases for AI agents — and a head-to-head test of three approaches: plain Markdown, Graph-RAG, and a Knowledge Graph behind MCP tools.

TL;DR

Before writing code, a good AI agent should answer one question about a feature: what does this change touch? To answer it, the agent needs a knowledge base — its memory of your services. I built that knowledge base three different ways over the same two services and measured them. A knowledge graph the agent can query beat plain markdown docs and even semantic search (RAG), because "what depends on what" is a graph question.

The question every change starts with

You work on a platform with lots of microservices. A feature lands — "add a loyalty tier to orders" — and before any code, someone has to answer the scary question in distributed systems:

What does this actually touch?

Which services, which APIs, which events, which databases. Miss something and you ship a change to one service that quietly breaks another one three hops away.

More and more, we ask an AI agent to answer this — to read the feature and write a short spec first (this is called Spec-Driven Development: spec before code, a human approves, then you build). It's a great workflow. But it has one dependency that decides everything.

What's a "knowledge base," exactly?

If you've heard "knowledge base" in AI talks and just nodded — here's the plain version.

A knowledge base (KB) is the set of facts a system looks up to make a decision. For an agent working on your platform, the KB is everything it knows about your services: their APIs, the events they publish and consume, what they call, what data they own. Think of it as the agent's memory of your architecture.

The agent wasn't trained on your private services, so it can't reason about them from thin air. It consults the KB — like you'd open a wiki. No KB, no reasoning. A weak KB, weak reasoning.

Why the knowledge base is the real bottleneck

Here's the catch that makes the KB the most important piece:

The agent's spec is only as good as what it knows. Ask "what does this loyalty feature touch?" and if the knowledge base can't reveal that changing an event breaks a downstream listener, the spec won't mention it — and the workflow confidently ships a break. In short: the knowledge base caps the spec. So the real question isn't "is the agent smart?" — it's "is our knowledge base the right shape for the agent to find impact?"

For most teams, it isn't.

Where it breaks today

Most teams document services as a folder of README.md files — one per service, written in prose:

"order-service creates orders and publishes an order.created event. It calls inventory-service to reserve stock."

Great for humans. So we point the agent at all of them and ask for the impact of a feature. And it gives a confident, incomplete answer.

A concrete example. The feature: add a reservedQty field to the order.created event. The agent reads order-service.md, sees "publishes order.created," and says "order-service is affected." ✅ But — who consumes that event? That fact is on a different page the agent never opened, so it never flags that the consumer (inventory-service) must change too. That's exactly the thing the spec needed to catch, and it's gone.

This isn't the agent being dumb. It's the docs being the wrong shape. Prose has no "who-depends-on-me" links to follow. Every "who consumes this event?" or "who calls this API?" is invisible from the page you happen to read first. Pile on drift (docs rot) and size (40 long pages won't fit the prompt), and prose simply doesn't scale to an AI reader.

The fix: think in graphs, not pages

The shift is to picture your services not as documents but as a small graph of who-talks-to-whom:

Now "who breaks if order.created changes?" is a one-line answer: follow the consumes edge → inventory-service. "Who calls the reserve API?" — follow the edge the other way. The relationships the prose hid are now first-class. Impact becomes a lookup, not a guess.

One catalog, three ways to use it

There's more than one way to hand an agent that graph. To compare them fairly, I built all three knowledge bases from the same source — scan the services once into a structured catalog (APIs, events, dependencies), then expose that one catalog three ways:

Because the underlying data is identical, whichever wins, we know it's the way the agent reads it that won — not better docs.

1. Markdown docs — generated per-service pages (so they never drift). The agent just reads prose. Simple, but it can't follow "who-consumes-this."

2. Graph-RAG — RAG (Retrieval-Augmented Generation) is the standard trick for giving an LLM new knowledge: turn each fact into a vector ("embedding") and fetch the ones closest to the question — like semantic search into the prompt. Graph-RAG adds one move: after retrieving, follow the graph one hop to pull in neighbours (the caller, the event's consumers). Bonus for Java teams: with Spring AI you can embed locally, no API key.

3. Graph + MCP — skip retrieval; let the agent query the graph directly. MCP (Model Context Protocol) is just a standard for "tools an AI can call." With Spring AI a tool is one annotation:

@Tool(description = "Who is affected if this API / topic / service changes?")
public Object get_dependents(String ref) { ... }   // ref = "topic:order.created"

@Tool(description = "Who is affected if this API / topic / service changes?")
public Object get_dependents(String ref) { ... }   // ref = "topic:order.created"

The agent calls it and gets an exact, repeatable answer — impact is computed, not guessed:

get_dependents("topic:order.created")
  → producers: [order-service], consumers: [inventory-service]

get_dependents("topic:order.created")
  → producers: [order-service], consumers: [inventory-service]

The test: same question, no peeking, judged fairly

Two rules made the comparison trustworthy:

The agent only sees the knowledge base — never the source code. (If it could read the code, we'd be testing the model, not the KB.)
An AI judge compares the answers pairwise, against the feature itself, the way you'd compare two pull requests against a ticket. I avoided hand-writing an "answer key" on purpose — I built one of the approaches, and my key could be biased. The judge also has guardrails: a hedged answer ("might affect…") counts as a miss, and longer isn't better.

The judge scored five things: did it find all the affected pieces (completeness)? did it avoid false alarms (scope)? were the API details exact? was it repeatable? was the spec readable?

The results

The graph-and-tools approach won, clearly. And the two runners-up failed in opposite ways — that's the interesting bit:

Markdown is cautious but blind. It never invents impact (perfect on false alarms), but it missed things — it saw the event's producer and never asked who consumes it. Safe, but incomplete.
Graph-RAG is eager. The graph hop found the hidden consumers and callers Markdown missed — but on small, single-service changes it sometimes dragged in a neighbour that wasn't affected. It finds more, but raises false alarms.
Graph + MCP is both — complete and precise — because it follows the real graph instead of reading prose or guessing from similarity.

In one line: Markdown misses, RAG over-reaches, the graph computes.

And the gap only grows with scale: with just two services there are barely any hidden links to miss, so the demo actually understates the winner's lead. At 100+ services, prose falls further behind.

You don't have to pick just one

The honest answer is to layer them — they're all generated from the same catalog, so they're different front doors, not rivals:

Graph + MCP as the engine for "what does this touch?"
Markdown for humans to read (generated, so it never drifts).
Graph-RAG when the question is fuzzy and you want recall over exactness.

Takeaways

In Spec-Driven Development, the knowledge base caps the spec. Fix the KB before blaming the agent.
"What does this touch?" is a graph question — answer it with a graph, not by reading or embedding paragraphs.
MCP turns a graph lookup into a tool the agent calls — in Spring AI, a single @Tool annotation.
Build the KB from code, not by hand — generated views don't drift, and you get a fair comparison.
Measure your knowledge base; don't assume it. Completeness, false-alarm rate, and repeatability are all scorable — in an afternoon.

The full project — two Spring Boot services, all three knowledge bases, the test features, and the AI judge — is on GitHub: github.com/ganesh/sdd-knowledgebase-evaluation. Clone it and check my numbers.

Contents