Why Verification Loops? The Quality Control Problem

Before we dive into the patterns, let's talk about why verification loops matter at all.

LLMs are indeterministic by design. They're probability engines, not databases. Ask the same question twice, you might get different answers. This isn't a bug, it's the architecture.

Think about human labor for a moment. Human workers have bad days. They make typos. They skip steps. They hallucinate details when they're not sure. Do we solve this by "finding perfect humans"? No. We build quality control processes:

  • Fact-checkers for journalism
  • Editors for publishing
  • QA testers for software
  • Code reviewers for engineering
  • Compliance audits for finance

QC is what makes human output reliable at scale. It's not about perfection, it's about catching errors systematically before they reach production.

Verification loops are the AI equivalent. They're not a nice-to-have feature. They're the fundamental quality control mechanism that separates demo-grade AI from production-grade AI.

The Key Insight

Some domains already have QC infrastructure built in. Software development, for example, has compilers, linters, test suites, and CI/CD pipelines. These domains are where LLMs shine brightest.

Other domains, creative writing, strategy, open-ended research, don't have this infrastructure. You need to build the QC from scratch.

The presence of verification infrastructure predicts LLM success. This is why coding was always the perfect use case for LLMs: we've been doing verification loops for decades.

Why Software Developers Were Ready for LLMs All Along

Here's something that clicked for me: software developers have been doing verification loops for 50+ years.

Think about it:

  • Compilers catch syntax errors instantly
  • Linters enforce code style and catch common bugs
  • Static analyzers find potential null pointer exceptions, memory leaks, security vulnerabilities
  • Test suites validate behavior against expected outputs
  • CI/CD pipelines run automated checks before deployment
  • Code reviews add human verification for complex logic

We didn't need to invent verification infrastructure for LLMs. We already had it.

This is why LLMs excel at coding tasks. The compiler provides objective, deterministic feedback. The test suite tells you immediately if the generated code works. You can feed error messages back to the LLM and get corrected code in the next iteration.

The Actionable Question

When you're applying LLMs to a new domain, ask yourself: "What's the compiler here?"

  • What automated verification can I add?
  • What structured output can I enforce?
  • What test suite can validate the results?

Domains with existing verification (coding, data extraction, structured analysis) are low-hanging fruit. The tools are already there.

Domains without built-in verification (creative writing, strategy, open-ended research) are harder precisely because there's no "compile error" to catch hallucinations. You have to build that infrastructure from scratch.

What Problem Do Verification Loops Solve?

If you've built anything with LLMs in production, you've faced it: the first output is a mix of genuine insight and confident fabrication. Hallucinations, logical gaps, factual errors, they slip through even with the best prompts.

Verification loops address this by treating generation not as a one-shot process, but as the first step in an iterative refinement cycle. The core insight? LLMs are measurably better at verifying content than generating it.

The Solver-Verifier Gap

Research consistently shows that LLMs can spot errors in existing content more reliably than they can avoid making those errors during generation. This asymmetry is the foundation of verification loops.

None

This gap exists because:

  1. Verification is discriminative (is this correct?) while generation is creative (what should this be?)
  2. Errors are easier to spot than to avoid, you only need to recognize one problem, not construct a perfect solution
  3. Context helps, reviewing existing output gives the model concrete material to analyze

The Math Behind It

Recent research (Yang et al., EMNLP 2025) provides a convergence formula showing that verification loops have diminishing returns:

None

The key takeaway: Rounds 1–2 capture 75% of reachable improvement. After that, you're fighting diminishing returns. This tells us to design loops that are "wide" (multiple verification strategies) rather than "deep" (endless iteration).

Source: Yang et al., "Self-Correction with LLMs Converges"

The Core Pattern: Generate → Review → Refine

The basic verification loop is straightforward:

let output = await generate(task);
let changes = output;
let round = 0;

while (true) {
    round += 1;
    const findings = await review(changes);
    
    if (findings === 0 && round >= 2) {
        break;
    }
    
    changes = await fix(changes, findings);
}

Key details:

  1. Always run at least 2 review passes, even if the first review finds nothing. Verification is probabilistic, and a second pass gives the model another independent chance to catch errors.
  2. Review only what changed. After the first round, focus on the fixes that were just applied, not the entire output. This reduces token costs and keeps the model's attention focused.
  3. Hard cap at 5–6 rounds to avoid oscillation (where the model keeps "fixing" things that weren't broken).

Six Architectural Patterns

1. Self-Refinement Loop

The same model generates content, then reviews its own output, then revises.

async function selfRefine(task: string, maxRounds: number = 3): Promise<string> {
    let output = await llmGenerate(task);
    
    for (let round = 1; round <= maxRounds; round++) {
        console.log(`Round ${round}: Reviewing...`);
        
        const feedback = await llmGenerate(
            `Review this output for errors, inconsistencies, or improvements:\n\n${output}\n\nProvide specific feedback:`
        );
        
        if (!feedback.includes('issue') && !feedback.includes('error') && round >= 2) {
            console.log('No issues found, stopping.');
            break;
        }
        
        output = await llmGenerate(
            `Revise the following based on this feedback:\n\nOriginal: ${output}\n\nFeedback: ${feedback}\n\nRevised version:`
        );
    }
    
    return output;
}

Why it works: Simple, no additional infrastructure needed.

Watch out: Same model biases may persist. For critical applications, use a different model for review.

Source: Madaan et al., "Self-Refine"

2. Evaluator-Refiner Architecture

Separate the generation and review into distinct agents with specialized prompts.

interface EvaluationResult {
    isValid: boolean;
    findings: string[];
    severity: 'critical' | 'major' | 'minor';
}

async function evaluatorRefiner(task: string): Promise<string> {
    // Generator: focused on creation
    let output = await llmGenerate(task, {
        systemPrompt: 'You are a creative generator. Focus on producing high-quality initial output.'
    });
    
    // Evaluator: focused on critique
    const evaluation = await llmGenerate<EvaluationResult>(
        `Evaluate this output for accuracy, consistency, and completeness:\n\n${output}`,
        {
            systemPrompt: 'You are a critical evaluator. Your job is to find errors and gaps. Be thorough.',
            outputSchema: evaluationSchema // JSON Schema for structured output
        }
    );
    
    if (!evaluation.isValid) {
        // Refiner: focused on fixing specific issues
        output = await llmGenerate(
            `Fix these specific issues in the following output:\n\nOriginal: ${output}\n\nIssues to fix: ${evaluation.findings.join(', ')}`,
            {
                systemPrompt: 'You are a precise refiner. Fix only the issues listed, preserve everything else.'
            }
        );
    }
    
    return output;
}

Why it works: Separation of concerns. Each agent has a clear role. ~20% improvement over self-refinement.

Production pattern: This is the AWS Bedrock Agents approach (Evaluator-Refiner pattern).

Source: AWS Prescriptive Guidance

3. Chain of Verification (CoVe)

Generate independent fact-checking questions, answer them without seeing the original draft, then revise based on verified facts.

interface VerificationQuestion {
    question: string;
    answer: string;
    contradictsOriginal: boolean;
}

async function chainOfVerification(topic: string): Promise<string> {
    // Step 1: Generate initial draft
    const draft = await llmGenerate(`Write about ${topic}`);
    
    // Step 2: Generate verification questions
    const questions = await llmGenerate<string[]>(
        `Generate 5 factual verification questions to check this content:\n\n${draft}`,
        { outputSchema: { type: 'array', items: { type: 'string' } } }
    );
    
    // Step 3: Answer each question independently (without showing draft)
    const verifiedFacts: VerificationQuestion[] = [];
    
    for (const question of questions) {
        const answer = await llmGenerate(question, {
            systemPrompt: 'Answer this factual question based on your knowledge. Do not reference any previous text.'
        });
        
        // Step 4: Check if answer contradicts draft
        const contradiction = await llmGenerate<boolean>(
            `Does this answer contradict the following statement?\n\nStatement: ${draft}\n\nAnswer: ${answer}`,
            { outputSchema: { type: 'boolean' } }
        );
        
        verifiedFacts.push({ question, answer, contradictsOriginal: contradiction });
    }
    
    // Step 5: Revise draft based on verified facts
    const contradictions = verifiedFacts.filter(f => f.contradictsOriginal);
    
    if (contradictions.length > 0) {
        return await llmGenerate(
            `Revise this draft to correct these factual errors:\n\nDraft: ${draft}\n\nVerified facts: ${contradictions.map(f => f.answer).join(', ')}`
        );
    }
    
    return draft;
}

Why it works: Prevents the model from being biased by its own draft when fact-checking. Independent verification is more reliable.

Source: Dhuliawala et al., "Chain of Verification"

4. Structured Output + Validation

Enforce output schemas (JSON Schema, Pydantic, TypeScript interfaces) and auto-retry on validation failures.

import { z } from 'zod';

const ProductSchema = z.object({
    name: z.string().min(1),
    price: z.number().positive(),
    category: z.enum(['electronics', 'clothing', 'books', 'home']),
    description: z.string().max(500)
});

type Product = z.infer<typeof ProductSchema>;

async function extractProductWithValidation(text: string): Promise<Product> {
    let attempts = 0;
    const maxAttempts = 3;
    
    while (attempts < maxAttempts) {
        attempts++;
        console.log(`Attempt ${attempts}...`);
        
        const rawOutput = await llmGenerate(
            `Extract product information from this text:\n\n${text}`,
            {
                systemPrompt: 'Extract product data as valid JSON matching the schema.',
                outputSchema: ProductSchema // Instructor-style schema enforcement
            }
        );
        
        try {
            const product = ProductSchema.parse(rawOutput);
            console.log('✓ Validation passed');
            return product;
        } catch (error) {
            if (error instanceof z.ZodError) {
                const errorMessage = error.errors.map(e => `${e.path.join('.')}: ${e.message}`).join(', ');
                console.log(`✗ Validation error: ${errorMessage}`);
                
                // Feed error back to LLM for correction
                rawOutput = await llmGenerate(
                    `Fix these validation errors in the JSON:\n\nCurrent: ${JSON.stringify(rawOutput)}\n\nErrors: ${errorMessage}\n\nCorrected JSON:`
                );
            }
        }
    }
    
    throw new Error('Failed to extract valid product after 3 attempts');
}

Why it works: Objective validation (schema enforcement) replaces subjective review. Clear error messages guide correction.

Tools: Instructor, Fructose, Zod (TypeScript)

5. Real-Time Streaming Validation

Validate output chunk-by-chunk as it streams, catching errors early before the full response is generated.

interface StreamChunk {
    content: string;
    isValid: boolean;
    error?: string;
}

async function* streamWithValidation(prompt: string): AsyncGenerator<StreamChunk> {
    const stream = await llmStream(prompt);
    let accumulatedContent = '';
    
    for await (const chunk of stream) {
        accumulatedContent += chunk.content;
        
        // Validate every 100 tokens
        if (accumulatedContent.length % 100 < chunk.content.length) {
            const validation = await validateChunk(accumulatedContent);
            
            if (!validation.isValid) {
                yield {
                    content: chunk.content,
                    isValid: false,
                    error: validation.error
                };
                
                // Optionally abort or request regeneration
                console.log(`⚠ Validation failed at ${accumulatedContent.length} chars: ${validation.error}`);
            } else {
                yield { content: chunk.content, isValid: true };
            }
        } else {
            yield { content: chunk.content, isValid: true };
        }
    }
}

async function validateChunk(content: string): Promise<{ isValid: boolean; error?: string }> {
    // Example: Check for hallucinated citations
    const citationPattern = /\[\d+\]/g;
    const citations = content.match(citationPattern) || [];
    
    if (citations.length > 0) {
        // Verify each citation exists in source material
        for (const citation of citations) {
            const isValid = await verifyCitation(citation, content);
            if (!isValid) {
                return { isValid: false, error: `Invalid citation: ${citation}` };
            }
        }
    }
    
    return { isValid: true };
}

Why it works: Catches errors mid-generation, saving tokens and time. Enables real-time course correction.

Source: Guardrails AI — Real-time Validation

6. Human-in-the-Loop Escalation

Route high-risk outputs to human review based on confidence scores or policy violations.

interface RiskAssessment {
    riskLevel: 'low' | 'medium' | 'high';
    confidence: number;
    requiresHumanReview: boolean;
    reasons: string[];
}

async function generateWithHumanFallback(task: string): Promise<string> {
    const output = await llmGenerate(task);
    
    // Assess risk
    const risk = await llmGenerate<RiskAssessment>(
        `Assess the risk level of this output:\n\n${output}`,
        {
            systemPrompt: 'Evaluate risk based on: factual claims, potential harm, policy violations, confidence level.',
            outputSchema: {
                type: 'object',
                properties: {
                    riskLevel: { type: 'string', enum: ['low', 'medium', 'high'] },
                    confidence: { type: 'number', minimum: 0, maximum: 1 },
                    requiresHumanReview: { type: 'boolean' },
                    reasons: { type: 'array', items: { type: 'string' } }
                }
            }
        }
    );
    
    if (risk.requiresHumanReview || risk.riskLevel === 'high') {
        console.log('⚠ Escalating to human review:', risk.reasons);
        
        // Send to human review queue (Slack, email, dashboard, etc.)
        const humanApproved = await sendForHumanReview(output, risk);
        
        if (!humanApproved) {
            throw new Error('Output rejected by human reviewer');
        }
    }
    
    return output;
}

Why it works: Balances automation with human oversight for high-stakes decisions.

Production pattern: Use for medical, legal, financial, or policy-sensitive content.

Practical Implementation Examples

Example 1: Code Generation with Compiler-in-the-Loop

import { execSync } from 'child_process';
import * as fs from 'fs';
import * as path from 'path';
import * as os from 'os';

interface CodeGenerationResult {
    code: string;
    success: boolean;
    errors?: string[];
    round: number;
}

async function generateAndVerifyCode(
    task: string,
    maxRounds: number = 5
): Promise<CodeGenerationResult> {
    /**
     * Code generation with compiler-in-the-loop verification.
     * The compiler provides objective, deterministic feedback.
     */
    let output: string | null = null;
    let changes = task;
    let round = 0;
    
    while (round < maxRounds) {
        round += 1;
        console.log(`\n=== Round ${round} ===`);
        
        // Generate code
        output = await llmGenerate(changes);
        console.log(`Generated:\n${output}`);
        
        // Write to temp file
        const tempDir = fs.mkdtempSync(path.join(os.tmpdir(), 'code-gen-'));
        const tempPath = path.join(tempDir, 'generated.ts');
        fs.writeFileSync(tempPath, output);
        
        try {
            // Compile check using tsc
            try {
                execSync(`npx tsc --noEmit ${tempPath}`, { 
                    encoding: 'utf-8',
                    stdio: 'pipe'
                });
                console.log('✓ Compilation successful');
                
                // Run tests if they exist
                if (task.toLowerCase().includes('test')) {
                    try {
                        execSync(`npx jest ${tempPath}`, { 
                            encoding: 'utf-8',
                            stdio: 'inherit'
                        });
                        console.log('✓ All tests passed');
                        return { code: output!, success: true, round };
                    } catch (testError: any) {
                        const testOutput = testError.stdout || testError.stderr;
                        console.log(`✗ Test failures:\n${testOutput}`);
                        changes = `Code:\n${output}\n\nTest failures:\n${testOutput}`;
                        continue;
                    }
                }
                
                return { code: output!, success: true, round };
                
            } catch (compileError: any) {
                const errorMsg = compileError.stderr || compileError.message;
                console.log(`✗ Compilation error: ${errorMsg}`);
                changes = `Code:\n${output}\n\nCompiler error:\n${errorMsg}`;
            }
            
        } finally {
            // Cleanup temp files
            fs.rmSync(tempDir, { recursive: true, force: true });
        }
    }
    
    console.log(`⚠ Max rounds reached. Returning last output.`);
    return { code: output || '', success: false, round, errors: ['Max rounds reached'] };
}

// Usage
const result = await generateAndVerifyCode(
    "Write a function that calculates fibonacci numbers with memoization"
);

Why this works: The compiler doesn't lie. Error messages are precise and actionable. This is the "compiler as verifier" pattern that developers have used for decades.

Example 2: Data Extraction with Schema Validation

mport { z } from 'zod';

const ProductSchema = z.object({
    name: z.string().min(1),
    price: z.number().positive(),
    category: z.enum(['electronics', 'clothing', 'books', 'home']),
    inStock: z.boolean(),
    tags: z.array(z.string()).max(10)
}).refine(data => data.price < 10000, {
    message: "Price seems unrealistic (over $10,000)",
    path: ["price"]
});

type Product = z.infer<typeof ProductSchema>;

async function extractProductFromText(text: string): Promise<Product> {
    let attempts = 0;
    let rawOutput: any = null;
    
    while (attempts < 3) {
        attempts++;
        console.log(`Extraction attempt ${attempts}...`);
        
        rawOutput = await llmGenerate(
            `Extract product information from this text:\n\n${text}`,
            {
                systemPrompt: 'Extract product data as valid JSON. Follow the schema exactly.',
                outputSchema: ProductSchema
            }
        );
        
        try {
            const product = ProductSchema.parse(rawOutput);
            console.log('✓ Extraction successful');
            return product;
        } catch (error) {
            if (error instanceof z.ZodError) {
                const errors = error.errors.map(e => 
                    `${e.path.join('.')}: ${e.message}`
                ).join('; ');
                console.log(`✗ Validation failed: ${errors}`);
                
                // Feed specific errors back for correction
                rawOutput = await llmGenerate(
                    `Fix these validation errors:\n\nCurrent: ${JSON.stringify(rawOutput)}\n\nErrors: ${errors}\n\nCorrected JSON:`
                );
            }
        }
    }
    
    throw new Error(`Failed to extract valid product after ${attempts} attempts`);
}

// Usage
const product = await extractProductFromText(
    "The new XPhone Pro costs $999 and comes in black, white, and blue. It's available now in the electronics section."
);
// { name: "XPhone Pro", price: 999, category: "electronics", inStock: true, tags: [] }

Example 3: Factual Content with Chain of Verification

interface VerificationResult {
    question: string;
    answer: string;
    contradictsDraft: boolean;
}

async function generateVerifiedArticle(topic: string, maxQuestions: number = 5): Promise<string> {
    // Step 1: Generate initial draft
    console.log('Generating initial draft...');
    const draft = await llmGenerate(`Write a detailed article about ${topic}. Include facts, dates, and key figures.`);
    
    // Step 2: Generate verification questions
    console.log('Generating verification questions...');
    const questions = await llmGenerate<string[]>(
        `Generate ${maxQuestions} factual verification questions to check this content for accuracy:\n\n${draft}`,
        { outputSchema: { type: 'array', items: { type: 'string' } } }
    );
    
    // Step 3: Answer each question independently
    console.log('Answering verification questions...');
    const verifications: VerificationResult[] = [];
    
    for (const question of questions) {
        // Answer without showing draft (prevents bias)
        const answer = await llmGenerate(question, {
            systemPrompt: 'Answer this factual question based on your knowledge. Be precise and cite sources if possible.'
        });
        
        // Check for contradictions
        const contradicts = await llmGenerate<boolean>(
            `Does this answer contradict the following statement?\n\nStatement: ${draft}\n\nAnswer: ${answer}\n\nReply true or false:`,
            { outputSchema: { type: 'boolean' } }
        );
        
        verifications.push({ question, answer, contradictsDraft: contradicts });
        
        if (contradicts) {
            console.log(`⚠ Contradiction found: ${question}`);
        }
    }
    
    // Step 4: Revise based on verified facts
    const contradictions = verifications.filter(v => v.contradictsDraft);
    
    if (contradictions.length > 0) {
        console.log(`Revising draft based on ${contradictions.length} contradictions...`);
        const revisedDraft = await llmGenerate(
            `Revise this article to correct these factual errors:\n\nOriginal Draft: ${draft}\n\nVerified Facts: ${contradictions.map(v => v.answer).join('\n')}\n\nRevised Article:`
        );
        return revisedDraft;
    }
    
    console.log('✓ No contradictions found');
    return draft;
}

// Usage
const article = await generateVerifiedArticle("The history of the Internet");

Example 4: API Response Validation with Policy Checks

import { z } from 'zod';

const APICallSchema = z.object({
    method: z.enum(['GET', 'POST', 'PUT', 'DELETE']),
    endpoint: z.string().url(),
    headers: z.record(z.string()),
    body: z.record(z.any()).optional(),
    timeout: z.number().positive().max(30000)
});

type APICall = z.infer<typeof APICallSchema>;

class PolicyChecker {
    private allowedEndpoints = new Set([
        'https://api.example.com/users',
        'https://api.example.com/products',
        'https://api.example.com/orders'
    ]);
    
    private blockedMethods = new Set(['DELETE']);
    
    check(call: APICall): { valid: boolean; reason?: string } {
        // Policy 1: Endpoint allowlist
        if (!this.allowedEndpoints.has(call.endpoint)) {
            return { valid: false, reason: `Endpoint not allowlisted: ${call.endpoint}` };
        }
        
        // Policy 2: Blocked methods
        if (this.blockedMethods.has(call.method)) {
            return { valid: false, reason: `Method ${call.method} is blocked` };
        }
        
        // Policy 3: Body size limit
        if (call.body && JSON.stringify(call.body).length > 10000) {
            return { valid: false, reason: 'Request body exceeds 10KB limit' };
        }
        
        return { valid: true };
    }
}

async function generateAndValidateAPICall(userRequest: string): Promise<APICall> {
    const policyChecker = new PolicyChecker();
    let attempts = 0;
    
    while (attempts < 3) {
        attempts++;
        console.log(`Generation attempt ${attempts}...`);
        
        const rawCall = await llmGenerate(
            `Generate an API call for this request: ${userRequest}`,
            { outputSchema: APICallSchema }
        );
        
        // Schema validation
        try {
            const apiCall = APICallSchema.parse(rawCall);
            
            // Policy checks
            const policyResult = policyChecker.check(apiCall);
            if (!policyResult.valid) {
                console.log(`✗ Policy violation: ${policyResult.reason}`);
                throw new Error(policyResult.reason);
            }
            
            // Dry run in sandbox
            console.log('✓ Schema and policy validation passed');
            console.log('Dry run:', apiCall);
            
            return apiCall;
            
        } catch (error: any) {
            console.log(`✗ Validation failed: ${error.message}`);
            
            // Feed error back for correction
            rawCall = await llmGenerate(
                `Fix this API call:\n\nCurrent: ${JSON.stringify(rawCall)}\n\nError: ${error.message}\n\nCorrected:`
            );
        }
    }
    
    throw new Error('Failed to generate valid API call after 3 attempts');
}

Example 5: Content Review with Structured Findings

import { z } from 'zod';

const FindingSchema = z.object({
    type: z.enum(['factual_error', 'logical_gap', 'clarity_issue', 'tone_mismatch']),
    severity: z.enum(['critical', 'major', 'minor']),
    location: z.string().describe('Line number or section description'),
    description: z.string(),
    suggestedFix: z.string()
});

type Finding = z.infer<typeof FindingSchema>;

async function reviewAndReviseContent(topic: string, maxRounds: number = 3): Promise<string> {
    let content = await llmGenerate(`Write about ${topic}`);
    
    for (let round = 1; round <= maxRounds; round++) {
        console.log(`\n=== Review Round ${round} ===`);
        
        // Generate structured findings
        const findings = await llmGenerate<Finding[]>(
            `Review this content for errors and issues:\n\n${content}`,
            {
                systemPrompt: 'Identify specific issues. Be critical but constructive.',
                outputSchema: { type: 'array', items: FindingSchema }
            }
        );
        
        console.log(`Found ${findings.length} issues`);
        
        // Filter to only critical/major issues
        const criticalFindings = findings.filter(f => 
            f.severity === 'critical' || f.severity === 'major'
        );
        
        if (criticalFindings.length === 0 && round >= 2) {
            console.log('✓ No critical issues, content ready');
            break;
        }
        
        // Apply fixes
        const fixPrompt = criticalFindings.map(f => 
            `Issue: ${f.description}\nFix: ${f.suggestedFix}`
        ).join('\n\n');
        
        content = await llmGenerate(
            `Revise this content to address these issues:\n\nCurrent Content: ${content}\n\nIssues to Fix:\n${fixPrompt}`
        );
    }
    
    return content;
}

// Usage
const article = await reviewAndReviseContent("The impact of AI on software development");

When Verification Loops Don't Work

None

Source: He et al., "CoV-RAG"

Key Takeaways

  1. Verification loops are quality control for AI, just like QC processes are for human labor. They're not optional for production systems.
  2. Developers have been doing this for decades. Compilers, linters, test suites, CI/CD, these are all verification loops. This is why LLMs work so well for coding.
  3. The solver-verifier gap is real. LLMs are better at reviewing than generating. Use this asymmetry.
  4. Rounds 1–2 capture 75% of improvement. Don't over-engineer deep loops. Make them wide instead (multiple verification strategies).
  5. Separate generation and review prompts. ~20% improvement over self-refinement.
  6. Always run at least 2 review passes. Even if the first finds nothing.
  7. Hard cap at 5–6 rounds. Avoid oscillation.
  8. Domains with existing verification infrastructure are low-hanging fruit. Ask "What's the compiler here?"
  9. For domains without verification, you must build it. Use structured output, policy checks, or human-in-the-loop.
  10. Use typed actions with JSON Schema. Add policy checks on top. Dry run in sandbox before real execution.

References

  1. Yang et al., "Self-Correction with LLMs Converges" (EMNLP 2025) — ACL Anthology
  2. Madaan et al., "Self-Refine: Iterative Refinement with Self-Feedback" (NeurIPS 2023) — arXiv:2303.17651
  3. Dhuliawala et al., "Chain of Verification Reduces Hallucination in LLMs" — arXiv:2309.11495
  4. Huang et al., "Large Language Models Cannot Self-Correct Pathways Yet" (ICLR 2024) — arXiv:2310.01798
  5. Liu et al., "Self-Correction in LLMs is Real but Limited" — arXiv:2406.02378
  6. He et al., "CoV-RAG: Enhancing Retrieval-Augmented Generation with Chain of Verification" — Papers With Code
  7. AWS Prescriptive Guidance, "Evaluator-Refiner Pattern" — AWS Docs
  8. Guardrails AI, "Real-time Validation for Streaming LLM Outputs" — Guardrails Blog
  9. Loth, Y., "LLM Verification Loops: The Math Behind Self-Correction" — DEV Community