Build a Self-Correcting AI Agent with LangGraph and Ollama

A fully local Generator → Critic → Adjudicator loop: three Ollama models, a structured JSON rubric, conditional routing, and a quality bar that decides when the loop ends.

In this article, I will show you how to build a self-correcting agent loop that runs entirely on your own machine. The agent takes a Python source file and produces Markdown API documentation for it. But instead of asking a model to write the documentation once and hoping for the best, we will set up a cycle: one model drafts, a second model criticizes the draft against the source code, and the draft goes back for revision until it clears a quality bar. A third model then does a final pass to resolve any leftover contradictions.

This pattern is usually called reflexion. The idea is simple: LLMs are much better at spotting flaws in a piece of text than they are at producing flawless text in one shot. So, you split the work into roles. A detail that matters here is that each role is served by a different local model. If the same model both writes and grades the text, it tends to rubber-stamp its own output. Using a separate critic model avoids that.

In my earlier articles, I built local agents from scratch, without a framework, and I said at the time that tools like LangGraph make more sense once you have built the thin version yourself. So, this time I am reaching for LangGraph on purpose, because a revision loop with conditional exits is exactly the kind of cyclic workflow it was designed for. The models are served by Ollama. I am using qwen3.5:9b as the generator, deepseek-coder:6.7b as the critic, and llama3:8b as the adjudicator, but the roles are configurable through a small JSON file, so it is perfectly possible to swap in whatever you have pulled locally.

The result is a single console command, reflexion, that reads a source file, runs the loop, prints a scorecard for every round, and writes the polished Markdown to disk. The whole pipeline is one Python file of around 360 lines.

Here are the seven parts we will go through:

Part 1: The workflow state and the critic's evaluation schema
Part 2: The Generator node
Part 3: The Critic node
Part 4: The router that decides whether to loop or exit
Part 5: The Adjudicator node
Part 6: Wiring up the graph
Part 7: The runtime, configuration, and output cleanup

So, let us get started.

Installation

The full code is on GitHub at local-reflexion-agent. Everything lives in a single package, src/reflexion/main.py, exposed as a console script via pyproject.toml.

First, clone the repo and install it in editable mode:

git clone https://github.com/jfjensen/local-reflexion-agent.git
cd local-reflexion-agent
pip install -e .

git clone https://github.com/jfjensen/local-reflexion-agent.git
cd local-reflexion-agent
pip install -e .

This pulls in langgraph, langchain-core, langchain-community, langchain-ollama, and pydantic, and registers one console script called reflexion.

Next, make sure Ollama is running locally and that you have pulled the three models referenced in config.json:

ollama pull qwen3.5:9b
ollama pull deepseek-coder:6.7b
ollama pull llama3:8b

ollama pull qwen3.5:9b
ollama pull deepseek-coder:6.7b
ollama pull llama3:8b

In your case, you may want to use different models, and that is fine. The config.json at the repo root is where the roles are assigned, and we will come back to it in Part 7.

The repo ships with three small sample source files in input/, so you can run the pipeline immediately:

reflexion --input input/source_function_a.py --output output/result.md --config config.json

reflexion --input input/source_function_a.py --output output/result.md --config config.json

So, with the install out of the way, let us go through the parts.

Part 1: The workflow state and the critic's evaluation schema

First, we need two data structures. One describes what the critic must return, and one describes the memory that flows through the graph.

The critic's output is the more interesting of the two. A local 6.7B model, left to its own devices, will happily reply with a paragraph of prose when you ask it for a score. We do not want prose, we want numbers and a single actionable fix. So, we define a Pydantic schema and force the model to fill it in:

class DocEvaluationSchema(BaseModel):
    """Pydantic schema used to force a structured JSON schema constraint out of our local critic model.
    Deliberately contains only judgment fields: the overall average and the pass/fail decision are computed in code."""

    technical_accuracy: int = Field(description="Score from 0-100 on accuracy.")
    completeness: int = Field(description="Score from 0-100 covering parameter definitions.")
    single_critical_fix: str = Field(description="The single absolute highest priority correction required.")

class DocEvaluationSchema(BaseModel):
    """Pydantic schema used to force a structured JSON schema constraint out of our local critic model.
    Deliberately contains only judgment fields: the overall average and the pass/fail decision are computed in code."""

    technical_accuracy: int = Field(description="Score from 0-100 on accuracy.")
    completeness: int = Field(description="Score from 0-100 covering parameter definitions.")
    single_critical_fix: str = Field(description="The single absolute highest priority correction required.")

A few things to note:

The single_critical_fix field is the heart of the loop. Instead of asking the critic for a list of everything wrong, we ask for the one highest-priority correction. A small generator model handles "fix this one thing" far better than a wall of mixed feedback.
What is deliberately not in the schema matters as much as what is. An earlier version of this schema also asked the model for an overall percentage and a pass/fail boolean. That turned out to be a mistake, and I will come back to how it failed in Part 3. The short version: never ask a model to do arithmetic or a threshold comparison that Python can do in one line. The schema now contains only the judgments a model is actually needed for, and the critic node computes the average and the pass/fail decision in code.
The Field descriptions are not decoration. They are passed to the model as part of the JSON schema constraint, so they are effectively micro-prompts.

Then, the state. LangGraph passes a state object from node to node, and each node returns a partial update to it. We define it as a TypedDict:

def append_scores(old: List[int], new: List[int]) -> List[int]:
    return old + new

class AgentWorkflowState(TypedDict):
    """Memory tracking state for the heterogeneous local graph layout."""
    source_code: str
    generated_markdown: str
    current_score: int
    score_history: Annotated[List[int], append_scores]
    required_fix: Optional[str]
    iteration_count: int
    is_approved: bool
    final_adjudicated_markdown: Optional[str]

def append_scores(old: List[int], new: List[int]) -> List[int]:
    return old + new

class AgentWorkflowState(TypedDict):
    """Memory tracking state for the heterogeneous local graph layout."""
    source_code: str
    generated_markdown: str
    current_score: int
    score_history: Annotated[List[int], append_scores]
    required_fix: Optional[str]
    iteration_count: int
    is_approved: bool
    final_adjudicated_markdown: Optional[str]

The one non-obvious line is score_history. The Annotated[List[int], append_scores] type tells LangGraph to merge updates to this field using the append_scores reducer instead of overwriting it. So, every time the critic returns a score, it gets appended to the history rather than replacing it. At the end of a run you get the full score curve across all revision rounds, which turns out to be the most informative single line the pipeline prints.

Part 2: The Generator node

Now, the first node. In LangGraph, a node is just a function that takes the state and returns a dictionary of updates. The Generator has two modes: on the first pass it drafts documentation straight from the source code, and on every later pass it revises its previous draft to address the critic's fix.

def content_generator_node(state: AgentWorkflowState) -> dict:
    """The Generator node: Focuses entirely on drafting/editing documentation text layout."""
    current_iteration = state.get("iteration_count", 0)
    target_model = MODEL_REGISTRY["GENERATOR"]

    # ... console logging trimmed, see the repo for the full version
    llm = ChatOllama(model=target_model, temperature=0.7)
    if not state.get("generated_markdown"):
        system_prompt = "You are an expert technical writer. Convert raw code snippets into clean, comprehensive API markdown documentation."
        user_prompt = f"Generate Markdown documentation for the following source code:\n\n{state['source_code']}"
    else:
        system_prompt = "You are a precise editor rewriting technical documentation to fix a structural flaw."
        user_prompt = (
            f"Your previous documentation draft was scored {state['current_score']}/100.\n"
            f"You must completely rewrite the documentation to fix this specific issue: {state['required_fix']}\n\n"
            f"Original Code Context:\n{state['source_code']}\n\n"
            f"Previous Draft:\n{state['generated_markdown']}"
        )
    messages = [SystemMessage(content=system_prompt), HumanMessage(content=user_prompt)]
    response = llm.invoke(messages)
    return {
        "generated_markdown": response.content,
        "iteration_count": current_iteration + 1
    }

def content_generator_node(state: AgentWorkflowState) -> dict:
    """The Generator node: Focuses entirely on drafting/editing documentation text layout."""
    current_iteration = state.get("iteration_count", 0)
    target_model = MODEL_REGISTRY["GENERATOR"]

    # ... console logging trimmed, see the repo for the full version
    llm = ChatOllama(model=target_model, temperature=0.7)
    if not state.get("generated_markdown"):
        system_prompt = "You are an expert technical writer. Convert raw code snippets into clean, comprehensive API markdown documentation."
        user_prompt = f"Generate Markdown documentation for the following source code:\n\n{state['source_code']}"
    else:
        system_prompt = "You are a precise editor rewriting technical documentation to fix a structural flaw."
        user_prompt = (
            f"Your previous documentation draft was scored {state['current_score']}/100.\n"
            f"You must completely rewrite the documentation to fix this specific issue: {state['required_fix']}\n\n"
            f"Original Code Context:\n{state['source_code']}\n\n"
            f"Previous Draft:\n{state['generated_markdown']}"
        )
    messages = [SystemMessage(content=system_prompt), HumanMessage(content=user_prompt)]
    response = llm.invoke(messages)
    return {
        "generated_markdown": response.content,
        "iteration_count": current_iteration + 1
    }

Here is a step-by-step description of the above code:

Mode selection:

If generated_markdown is empty, this is round one, and the model is prompted as a technical writer working from the raw source code.
If a draft already exists, the model is prompted as an editor instead. It gets the previous score, the single critical fix from the critic, the original source code, and its own previous draft.

The revision prompt:

Telling the model its previous score is a small trick, but it works. A concrete "you scored 78/100" pushes the model to actually change things rather than lightly rephrase.
The instruction says to completely rewrite the documentation to fix one specific issue. Narrow instructions like this are what small local models follow well.

Temperature:

The generator runs at temperature=0.7, so the drafts have some variety across revision rounds. The critic, as you will see next, runs at zero.

The return value:

The node returns only the fields it changed. LangGraph merges this into the state, and iteration_count goes up by one per generation round.

Part 3: The Critic node

Next, the critic. This is a different model entirely, deepseek-coder:6.7b by default, and getting its prompt right took me more attempts than any other part of this build.

def content_critic_node(state: AgentWorkflowState) -> dict:
    """The Critic node: Cross-checks generator output using a different model with strict JSON formatting."""
    target_model = MODEL_REGISTRY.get("CRITIC")

    llm = ChatOllama(model=target_model, temperature=0.0)
    structured_llm = llm.with_structured_output(DocEvaluationSchema)

    system_prompt = (
        f"You are a meticulous QA inspector evaluating API documentation against its source code. "
        f"You are grading the DOCUMENTATION, not the source code. Never deduct points because the source code "
        f"itself could be designed better, and never ask for the source code to be changed. "
        f"Score by deduction: start from 100 and subtract points only for concrete errors you can verify by "
        f"pointing at specific documentation text that contradicts the source code. "
        f"Check that every 'raise' statement in the source is documented with its exact exception type and its "
        f"exact quoted error message: a missing, denied, or misquoted exception is a major deduction. "
        f"Recompute every numeric value shown in the documentation's examples against what the source code would "
        f"actually return: a wrong number is a major deduction. "
        f"If you find no verifiable errors, score 100. "
        f"The single_critical_fix must point at the documentation text to change, not at the source code."
    )

    user_prompt = (
        f"Evaluate the following markdown documentation relative to its raw source code.\n\n"
        f"Raw Source Code:\n{state['source_code']}\n\n"
        f"Generated Markdown Document:\n{state['generated_markdown']}"
    )

    evaluation: DocEvaluationSchema = structured_llm.invoke([
        SystemMessage(content=system_prompt),
        HumanMessage(content=user_prompt)
    ])

    # The model provides the judgment scores; the arithmetic and the pass/fail decision are done here in code.
    overall_score = round((evaluation.technical_accuracy + evaluation.completeness) / 2)
    is_approved = overall_score >= QUALITY_SCORE_BAR
    required_fix = sanitize_critical_fix(evaluation.single_critical_fix)

    return {
        "current_score": overall_score,
        "score_history": [overall_score],
        "required_fix": required_fix,
        "is_approved": is_approved
    }

def content_critic_node(state: AgentWorkflowState) -> dict:
    """The Critic node: Cross-checks generator output using a different model with strict JSON formatting."""
    target_model = MODEL_REGISTRY.get("CRITIC")

    llm = ChatOllama(model=target_model, temperature=0.0)
    structured_llm = llm.with_structured_output(DocEvaluationSchema)

    system_prompt = (
        f"You are a meticulous QA inspector evaluating API documentation against its source code. "
        f"You are grading the DOCUMENTATION, not the source code. Never deduct points because the source code "
        f"itself could be designed better, and never ask for the source code to be changed. "
        f"Score by deduction: start from 100 and subtract points only for concrete errors you can verify by "
        f"pointing at specific documentation text that contradicts the source code. "
        f"Check that every 'raise' statement in the source is documented with its exact exception type and its "
        f"exact quoted error message: a missing, denied, or misquoted exception is a major deduction. "
        f"Recompute every numeric value shown in the documentation's examples against what the source code would "
        f"actually return: a wrong number is a major deduction. "
        f"If you find no verifiable errors, score 100. "
        f"The single_critical_fix must point at the documentation text to change, not at the source code."
    )

    user_prompt = (
        f"Evaluate the following markdown documentation relative to its raw source code.\n\n"
        f"Raw Source Code:\n{state['source_code']}\n\n"
        f"Generated Markdown Document:\n{state['generated_markdown']}"
    )

    evaluation: DocEvaluationSchema = structured_llm.invoke([
        SystemMessage(content=system_prompt),
        HumanMessage(content=user_prompt)
    ])

    # The model provides the judgment scores; the arithmetic and the pass/fail decision are done here in code.
    overall_score = round((evaluation.technical_accuracy + evaluation.completeness) / 2)
    is_approved = overall_score >= QUALITY_SCORE_BAR
    required_fix = sanitize_critical_fix(evaluation.single_critical_fix)

    return {
        "current_score": overall_score,
        "score_history": [overall_score],
        "required_fix": required_fix,
        "is_approved": is_approved
    }

Here is a step-by-step description of the above code:

Structured output:

llm.with_structured_output(DocEvaluationSchema) is the important line. It constrains the model so that what comes back is a validated DocEvaluationSchema object, not free text. No JSON parsing, no regex, no retry-on-malformed-output logic.
The critic runs at temperature=0.0. A grader should be deterministic, or as close to it as a local model gets.

The prompt, and how it got this way:

This prompt went through three versions, and the history is worth telling because it is the most transferable lesson in this article. My first version politely asked the model to analyze the documentation and score it. The result was polite scores in the high 80s and 90s for documents with invented error messages in them, and the loop exited too early.
So, version two told the model to be "an exceptionally harsh QA inspector" and to "tank the score aggressively" when it found problems. On the next test run the critic scored a substantially correct draft 0 out of 100, four rounds in a row, and its justification for round one was flatly wrong about the source code. A small critic model does not calibrate, it follows tone. Told to be generous, it rubber-stamps. Told to be harsh, it floors everything.
The version above steers by procedure instead of tone. Score by deduction, starting from 100. Subtract only for errors you can verify by pointing at specific documentation text. If nothing is verifiably wrong, the score is 100. Anchoring the mechanics of scoring gives the model far less room to express a mood.
The prompt also states outright that the critic is grading the documentation, not the function. This exists because in several runs the "critical fix" was a critique of the source code's design, missing validation, weak cryptography, and so on, which the generator can do nothing about, since its job is to describe the code as it is.
The two verification checks, exact exception types with quoted messages and recomputed example numbers, are named explicitly because those were the two error classes that slipped past vaguer prompts in my test runs. Naming the specific checks you care about works far better with a small critic model than a general plea for accuracy.

The arithmetic happens in code, not in the model:

This is the lesson I mentioned back in Part 1, and I learned it from a run I could not quite believe. My earlier version asked the model itself for an overall percentage and a pass/fail boolean. In one run, the critic returned technical accuracy 100 and completeness 100 with a "net average" of exactly 95, which happened to be the pass bar it had been told about, not any average of its own scores. In another run, it scored a draft 97 against a bar of 95 four times in a row and set the pass flag to false every time, so the loop burned all four attempts on drafts that had already passed. The router trusted the boolean, and its log cheerfully reported that 97 was below 95.
So, the model now reports only the two sub-scores, and these three lines of Python do the rest: the average, the comparison against QUALITY_SCORE_BAR, and the decision the router will act on. A threshold comparison is not a judgment call, and nothing that is not a judgment call should be delegated to the model.

The sanitize_critical_fix guard:

In that same four-round run, the critic's feedback in one round degenerated into a few hundred repeated ../ tokens, which then got injected verbatim into the generator's next revision prompt as "the issue to fix". Small models under structured-output constraints do this occasionally.
The guard checks whether the fix string is empty or degenerate, essentially whether it repeats one token or is built from a handful of distinct characters, and if so replaces it with a generic instruction to re-verify the documentation against the source. Not clever, but it stops garbage from steering a whole revision round.

The guard itself is small enough to show in full:

def sanitize_critical_fix(fix: Optional[str]) -> str:
    """Guards against degenerate critic feedback (empty strings or runaway token repetition)."""
    stripped = (fix or "").strip()
    words = stripped.split()
    is_degenerate = (
        not stripped
        or len(set(stripped)) <= 4
        or (len(words) > 5 and len(set(words)) <= 2)
    )
    if is_degenerate:
        return "Re-verify the documentation line by line against the source code and correct the weakest section."
    return stripped

def sanitize_critical_fix(fix: Optional[str]) -> str:
    """Guards against degenerate critic feedback (empty strings or runaway token repetition)."""
    stripped = (fix or "").strip()
    words = stripped.split()
    is_degenerate = (
        not stripped
        or len(set(stripped)) <= 4
        or (len(words) > 5 and len(set(words)) <= 2)
    )
    if is_degenerate:
        return "Re-verify the documentation line by line against the source code and correct the weakest section."
    return stripped

One last detail on the return value: score_history goes back as a one-element list, and the reducer from Part 1 appends it to the existing history, so the score curve accumulates across rounds.

The node also prints a scorecard to the console for every round, with the technical accuracy, the completeness, the net score against the pass bar, and the critic's stated feedback. I trimmed those print statements here, but they are in the repo, and watching the scorecard change across rounds is honestly half the fun of running this thing.

Part 4: The router that decides whether to loop or exit

So, we now have a generator and a critic, and we need something to decide what happens after each critique. In LangGraph this is a routing function: it inspects the state and returns the name of the next node.

def execution_router(state: AgentWorkflowState) -> Literal["content_generator", "content_adjudicator"]:
    """Monitors workflow states and determines whether to repeat, loop, or step out to adjudication."""

    if state["is_approved"]:
        return "content_adjudicator"

    if state["iteration_count"] >= MAX_ATTEMPTS:
        return "content_adjudicator"

    return "content_generator"

def execution_router(state: AgentWorkflowState) -> Literal["content_generator", "content_adjudicator"]:
    """Monitors workflow states and determines whether to repeat, loop, or step out to adjudication."""

    if state["is_approved"]:
        return "content_adjudicator"

    if state["iteration_count"] >= MAX_ATTEMPTS:
        return "content_adjudicator"

    return "content_generator"

A few things to note:

There are exactly two ways out of the loop. Either the critic approved the draft, or we hit the hard cap on attempts. In both cases the draft goes to the Adjudicator. In every other case it goes back to the Generator.
The hard cap matters more than it looks. A 95-point quality bar with a harsh critic means some drafts never pass. Without MAX_ATTEMPTS, the loop would grind through Ollama calls forever. Four attempts is somewhat arbitrary, but it is rough and good enough for our purposes, and it is configurable.
The Literal return type is not just documentation. It matches the routing map we will define when wiring the graph in Part 6.
As with the other nodes, I trimmed the console prints from this excerpt. In the repo, the router logs the attempt count and the score history before every decision.

Part 5: The Adjudicator node

Then, the final role. Whatever exits the loop, whether it passed or timed out, gets one last pass from a third model. The Adjudicator's job is not to add content. It is to resolve contradictions that crept in across revision rounds and to strip out anything conversational.

def content_adjudicator_node(state: AgentWorkflowState) -> dict:
    """The Adjudicator node: Runs outside the loop to resolve structural contradictions and finalize text."""
    target_model = MODEL_REGISTRY["ADJUDICATOR"]

    llm = ChatOllama(model=target_model, temperature=0.2)

    system_prompt = (
        "You are a senior technical documentation editor giving the final word on an API document "
        "that has been through multiple revision rounds. Your job is to read the raw source code, "
        "look at the latest draft, eliminate any conflicting statements, clean up semantic prose, and output the polished final markdown document. "
        "Do not include any preamble, comments, conversational pleasantries, or explanations-return ONLY the markdown documentation."
    )

    user_prompt = (
        f"Review and finalize this documentation template.\n\n"
        f"Raw Source Reference Code:\n{state['source_code']}\n\n"
        f"Latest Revision Draft:\n{state['generated_markdown']}\n\n"
        f"Last Remaining Critic Concern:\n\"{state['required_fix']}\""
    )

    response = llm.invoke([SystemMessage(content=system_prompt), HumanMessage(content=user_prompt)])

    return {"final_adjudicated_markdown": response.content}

def content_adjudicator_node(state: AgentWorkflowState) -> dict:
    """The Adjudicator node: Runs outside the loop to resolve structural contradictions and finalize text."""
    target_model = MODEL_REGISTRY["ADJUDICATOR"]

    llm = ChatOllama(model=target_model, temperature=0.2)

    system_prompt = (
        "You are a senior technical documentation editor giving the final word on an API document "
        "that has been through multiple revision rounds. Your job is to read the raw source code, "
        "look at the latest draft, eliminate any conflicting statements, clean up semantic prose, and output the polished final markdown document. "
        "Do not include any preamble, comments, conversational pleasantries, or explanations-return ONLY the markdown documentation."
    )

    user_prompt = (
        f"Review and finalize this documentation template.\n\n"
        f"Raw Source Reference Code:\n{state['source_code']}\n\n"
        f"Latest Revision Draft:\n{state['generated_markdown']}\n\n"
        f"Last Remaining Critic Concern:\n\"{state['required_fix']}\""
    )

    response = llm.invoke([SystemMessage(content=system_prompt), HumanMessage(content=user_prompt)])

    return {"final_adjudicated_markdown": response.content}

Here is a step-by-step description of the above code:

Why this node exists at all:

After a few "completely rewrite to fix X" rounds, a draft can end up with internal contradictions. One round adds a claim, a later rewrite contradicts it without removing it. The Adjudicator gets the source code, the latest draft, and the critic's last remaining concern, and is told to eliminate conflicting statements.
It also serves as the exit path for drafts that never cleared the bar. Rather than shipping the raw fourth attempt, we at least get a harmonized version of it.

Temperature:

The Adjudicator runs at temperature=0.2. Low, because this is an editing job, but not zero, because it still has to rewrite prose.

The instruction to return only Markdown:

Local models love to open with "Certainly! Here is your documentation:" and close with "I hope this helps." The prompt forbids it. As we will see in Part 7, the prompt alone is not enough, and there is a post-processing step to catch what slips through.

Part 6: Wiring up the graph

Now we assemble the pieces. This is where LangGraph earns its keep, because the whole topology, including the cycle, is declared in a handful of lines:

builder = StateGraph(AgentWorkflowState)

builder.add_node("content_generator", content_generator_node)
builder.add_node("content_critic", content_critic_node)
builder.add_node("content_adjudicator", content_adjudicator_node)

builder.set_entry_point("content_generator")
builder.add_edge("content_generator", "content_critic")

builder.add_conditional_edges(
    "content_critic",
    execution_router,
    {
        "content_generator": "content_generator",
        "content_adjudicator": "content_adjudicator"
    }
)

builder.add_edge("content_adjudicator", END)
agent_pipeline = builder.compile()

builder = StateGraph(AgentWorkflowState)

builder.add_node("content_generator", content_generator_node)
builder.add_node("content_critic", content_critic_node)
builder.add_node("content_adjudicator", content_adjudicator_node)

builder.set_entry_point("content_generator")
builder.add_edge("content_generator", "content_critic")

builder.add_conditional_edges(
    "content_critic",
    execution_router,
    {
        "content_generator": "content_generator",
        "content_adjudicator": "content_adjudicator"
    }
)

builder.add_edge("content_adjudicator", END)
agent_pipeline = builder.compile()

Here is a step-by-step description of the above code:

Nodes and the entry point:

The three node functions are registered under names, and the Generator is set as the entry point.
The Generator always hands off to the Critic. That edge is unconditional.

The conditional edge:

add_conditional_edges attaches the router from Part 4 to the Critic. After every critique, the router runs, and its return value is looked up in the mapping to find the next node. This mapping is the cycle: content_critic can route back to content_generator, which flows to content_critic again.

Compilation:

builder.compile() returns a runnable pipeline. From here on, the whole loop is a single agent_pipeline.invoke(initial_state) call.

If you have ever tried to write this loop by hand, with a while loop, manual state threading, and an attempt counter, you will recognize how much bookkeeping just disappeared. The graph declaration is also self-documenting in a way the hand-rolled loop never was.

Part 7: The runtime, configuration, and output cleanup

Finally, the run() function that the reflexion console script points to. It does four things: parse arguments, load configuration, run the pipeline, and clean up the output. I will not reproduce all of it here, because a lot of it is argument parsing you can read in the repo, but two pieces are worth showing.

First, the configuration loader. The model roles and loop parameters come from config.json:

{
  "model_registry": {
    "GENERATOR": "qwen3.5:9b",
    "CRITIC": "deepseek-coder:6.7b",
    "ADJUDICATOR": "llama3:8b"
  },
  "quality_score_bar": 95,
  "max_attempts": 4
}

{
  "model_registry": {
    "GENERATOR": "qwen3.5:9b",
    "CRITIC": "deepseek-coder:6.7b",
    "ADJUDICATOR": "llama3:8b"
  },
  "quality_score_bar": 95,
  "max_attempts": 4
}

If the file is missing or unreadable, the loader falls back to these same defaults, so the pipeline always starts. The defaults for the input, output, and config paths are resolved relative to the installed package, which with an editable install means the repo root. So, running plain reflexion from anywhere will pick up the repo's config.json and input/ folder, and you can override any of it with --input, --output, and --config.

Second, the output cleanup. This is the part that dealing with local models made necessary:

raw_final_text = final_output.get("final_adjudicated_markdown", "")
clean_final_markdown = raw_final_text

# 1. Clean Preamble (Conversational text at the top)
if "#" in raw_final_text:
    parts = raw_final_text.split("#", 1)
    if len(parts[0].strip()) > 0 and len(parts[0].strip()) < 200:
        clean_final_markdown = "#" + parts[1]

# 2. Clean Postscript (Conversational and meta review text at the bottom)
meta_prefixes = (
    "note:", "note that", "i hope", "this concludes", "in conclusion",
    "no formatting errors", "no contradictions", "no errors",
    "let me know", "feel free", "please let me know"
)
meta_keywords = ("errors or contradictions", "contradictions were found", "errors were found")

lines = clean_final_markdown.rstrip().split("\n")
while lines:
    last_line = lines[-1].strip()
    if not last_line:
        lines.pop()
        continue
    lowered = last_line.lower()
    # Real content usually carries Markdown structure; meta chatter does not.
    is_structural = lowered.startswith(("#", "|", "-", "*", ">", "`")) or lowered[0].isdigit()
    is_meta = (not is_structural) and (
        lowered.startswith(meta_prefixes) or any(k in lowered for k in meta_keywords)
    )
    if not is_meta:
        break
    lines.pop()
clean_final_markdown = "\n".join(lines).rstrip() + "\n"

raw_final_text = final_output.get("final_adjudicated_markdown", "")
clean_final_markdown = raw_final_text

# 1. Clean Preamble (Conversational text at the top)
if "#" in raw_final_text:
    parts = raw_final_text.split("#", 1)
    if len(parts[0].strip()) > 0 and len(parts[0].strip()) < 200:
        clean_final_markdown = "#" + parts[1]

# 2. Clean Postscript (Conversational and meta review text at the bottom)
meta_prefixes = (
    "note:", "note that", "i hope", "this concludes", "in conclusion",
    "no formatting errors", "no contradictions", "no errors",
    "let me know", "feel free", "please let me know"
)
meta_keywords = ("errors or contradictions", "contradictions were found", "errors were found")

lines = clean_final_markdown.rstrip().split("\n")
while lines:
    last_line = lines[-1].strip()
    if not last_line:
        lines.pop()
        continue
    lowered = last_line.lower()
    # Real content usually carries Markdown structure; meta chatter does not.
    is_structural = lowered.startswith(("#", "|", "-", "*", ">", "`")) or lowered[0].isdigit()
    is_meta = (not is_structural) and (
        lowered.startswith(meta_prefixes) or any(k in lowered for k in meta_keywords)
    )
    if not is_meta:
        break
    lines.pop()
clean_final_markdown = "\n".join(lines).rstrip() + "\n"

A few things to note:

The preamble check. If there is text before the first # heading and it is shorter than 200 characters, it is almost certainly a "Here is the finalized document:" opener, and it gets cut. If it is longer than 200 characters, we leave it alone, because then it might be real content.
The postscript sweep. The bottom of the document is walked upwards, line by line. A line is chopped if it looks like model chatter rather than content: a "Note:" or "I hope" sign-off, or a leftover review verdict like "No formatting errors or contradictions were found.", which one of my test runs left sitting at the end of an otherwise decent document. Because it is a loop rather than a single check, several stacked sign-off lines go in one pass.
The structural guard. Any line that starts with a Markdown marker, a heading, a table row, a bullet, a blockquote, a code fence, or a digit, is never touched. So a table row that happens to contain the words "no errors" survives, and only bare prose lines are candidates for removal.
This is string surgery, not intelligence, and it will not catch everything. But it catches every pattern I have actually seen the models produce, and it costs nothing.

After cleanup, the final Markdown is written to the output path, and the console prints a run summary with the full score history and the total number of loop turns.

Trying it out

The repo ships three sample inputs: a metrics-filtering function, an API signature check, and a compound interest calculator. So, to document the first one:

reflexion --input input/source_function_a.py --output output/result.md --config config.json

reflexion --input input/source_function_a.py --output output/result.md --config config.json

You then get to watch the loop work. The Generator announces its round, the Critic prints its scorecard, and the Router explains whether it is looping back or exiting.

The output/ folder in the repo contains the results of my own runs on all three inputs, so you can see what the pipeline produces without running anything: proper parameter tables, return-value descriptions, and raised-exception tables with exact quoted error messages.

Two of the three output files contain a small example error of exactly the kind discussed below, a stated output that does not match what the code returns, and one example that raises instead of returning. I would rather commit honest exhibits than doctored ones, so consider spotting them an exercise.

When it goes wrong

Now, some honesty about where this goes wrong, because it does. These are failure modes I observed across my own runs. One run documented the wrong exception type with a completely invented error message, while the source raises a TypeError with a specific string.

Another run went the other way and claimed no explicit exception is raised at all. One document included a version history table with an invented "0.1 initial release" entry and a date of "TBD", none of which exists anywhere in the source. One example printed a yield ratio of roughly 0.86 where the real answer is 0.667. One otherwise decent document ended with the critic's own verdict, "No formatting errors or contradictions were found.", sitting there as the last line.

One run scored 97 against a pass bar of 95 four rounds in a row while insisting the draft was not ready, with one round's feedback collapsing into a few hundred repeated ../ tokens. And after I overcorrected the critic prompt toward harshness, the same critic scored a substantially correct draft 0 out of 100 four rounds in a row, justifying it with a claim about the source code that was simply false. The critic is not a fixed judge, it is not a calculator, and it will amplify whatever mood you put in its prompt.

Luckily, the observed failures cluster, and each cluster got a targeted mitigation in the code. The content failures, wrong exceptions and wrong arithmetic in the docs, are addressed by the critic prompt's named verification checks. The calibration failures, polite rubber-stamps at one extreme and floored zeros at the other, are addressed by the deduction-based scoring procedure in the prompt, and by moving the average and the threshold decision out of the model and into the critic node's Python.

The output failures, sign-offs and leftover review verdicts, are addressed by the postscript sweep, and the degenerate feedback by the sanitize guard. None of this turns the loop into a guarantee, a prompt instruction is a request, not a proof. If the scores still swing after all this, the most effective knob is not the prompt at all but the model: the critic is the hardest job in the registry, and swapping the CRITIC entry in config.json for a stronger general-purpose model costs one line and no code.

What the loop does do, reliably, is produce a much better document than a single-shot prompt to the same generator model, and the score history gives you an honest trace of how it got there. If your runs keep hitting the attempt cap, the knobs to reach for are quality_score_bar, which you can lower from 95, and max_attempts, which you can raise from 4.

Putting it all together

Putting all of this together, we end up with a single-file pipeline of around 360 lines that:

Defines a shared state and a structured evaluation schema for the critic, with the score averaging and the pass decision done in code, not by the model.
Drafts documentation with one local model and critiques it with a different one, so nobody grades their own homework.
Routes drafts back for targeted revision until a configurable quality bar is met or an attempt cap is hit.
Runs a third model as a final adjudicator to resolve contradictions from the revision rounds.
Cleans conversational preambles and postscripts out of the final Markdown before writing it to disk.
Is configured entirely through a small config.json, so swapping models or loosening the loop needs no code changes.

There is room to extend this. You could give the critic multiple rubric dimensions and have the router pick the weakest one to fix, instead of relying on a single critical fix. You could add a fourth node that runs the documented code's examples and feeds real errors back into the loop, which would catch the hallucinated claims the critic misses. Or you could point the same Generator → Critic → Adjudicator shape at a different task entirely, translation or test generation for example, since nothing in the graph is specific to documentation.

In my previous from-scratch agents, I argued that frameworks earn their weight once your workflow has real cycles, branching, and shared state. This pipeline is exactly that case, and LangGraph handled the parts I did not want to hand-roll: the state merging, the conditional routing, the loop itself. Having built the thin versions first, I could tell which parts of the framework were doing real work for me. I would still recommend that order.

References

The full code for this article: local-reflexion-agent
LangGraph and its documentation
langchain-ollama, the chat integration used for all three roles
Ollama, for serving the local models
Reflexion: Language Agents with Verbal Reinforcement Learning, the paper behind the pattern

Contents

Installation

Part 1: The workflow state and the critic's evaluation schema

Part 2: The Generator node

Part 3: The Critic node

Part 4: The router that decides whether to loop or exit

Part 5: The Adjudicator node

Part 6: Wiring up the graph

Part 7: The runtime, configuration, and output cleanup

Trying it out

When it goes wrong

Putting it all together

References