May 21, 2026
Code Is Not Cheap: How to Multiply Your AI’s Output With Software Fundamentals
Karpathy, Pocock, and the Software Fundamentals the AI Coding Hype Forgot
Yanli Liu
10 min read
In February 2025, Andrej Karpathy coined "vibe coding": describe what you want, let AI write the code, forget the code exists. It caught fire. Everyone wanted to believe coding had become as easy as talking.
One year later, Karpathy renamed it. The new term: "agentic engineering." His explanation was pointed. "'Engineering' to emphasize that there is an art and science and expertise to it." He'd gone from 80% manual coding to 80% agent coding in weeks, and discovered the hard way that models are "jagged" — brilliant at hard problems, then tripping over the obvious.
The data backs him up. GitClear's 2025 code quality study found that AI-coauthored pull requests have 1.7x more issues than human-only PRs. Copy-pasted code lines rose from 8.3% to 12.3% between 2021 and 2024. Meanwhile, AI now writes 41% of all code on GitHub, with 4.7 million paid Copilot subscribers.
We're producing more code than ever. It's also buggier than ever. That's not a productivity boom. That's technical debt at compound interest rates.
Matt Pocock put it bluntly in his AI Hero conference talk: "Code is not cheap. Bad code is the most expensive it's ever been."
His reasoning: if your codebase is hard to change, you can't absorb AI's output. Every suggestion, every generated function, every automated refactor runs into the friction of bad architecture. AI doesn't fix structural problems. It amplifies them.
Here's the thesis: AI is a multiplier, not a magic wand. It compounds whatever architecture you put in front of it. Good structure returns more value with every interaction. Bad structure accelerates rot with every prompt.
Five software fundamentals separate the teams shipping confidently from those drowning in AI-generated spaghetti. None of them are new. All of them are more important than they've ever been.
1. Align Before You Build
Karpathy identified four structural failure modes of AI coding agents. The first: "The models make wrong assumptions on your behalf and just run along with them without checking." You think you asked for a simple API endpoint. The AI built a microservice with authentication, rate limiting, and a database migration you never mentioned.
The problem isn't the AI's capability. It's that you and the AI don't share a mental model of what you're building.
Frederick Brooks wrote about this in The Design of Design. He called it the "design concept" — the invisible, shared theory of the thing you're creating. It's not a document. It's not a spec file. It's the understanding that exists between collaborators about what they're building and why. When two humans pair-program, they build this naturally through conversation. When you prompt an AI, that shared understanding doesn't exist by default.
Pocock's solution was a Claude Code skill he called "grill me." The entire instruction is two lines:
Interview me relentlessly about every aspect of this plan until we reach a shared understanding. Walk down each branch of the design tree, resolving dependencies between decisions one by one.
It went viral. 97,000+ GitHub stars. The AI asks 40, 60, sometimes 100 questions before it's satisfied. It turns the agent into an adversary who won't let you skip the thinking.
The principle: Don't let AI start coding until you share a mental model. Write the brief, define the constraints, answer the hard questions first. The 20 minutes you spend aligning saves the 3 hours you'd spend undoing wrong assumptions.
The 4 Lines Every CLAUDE.md Needs What Karpathy diagnosed, what 60,000 developers bookmarked, and why behavioral constraints beat feature checklists
2. Speak the Same Language
Alignment on the plan isn't enough if you and the AI are using the same words to mean different things.
Karpathy's second failure mode: "Overcomplicated solutions. Ask for a 10-line function, get a 200-line enterprise framework."
The AI isn't being malicious. It just doesn't know your vocabulary. When you say "handler," do you mean an HTTP handler, an event handler, or a log handler? When you say "service," is that a microservice, a background worker, or a class? The AI guesses. It guesses wrong. And then it builds a 200-line monument to that wrong guess.
This is a problem that software engineering solved 20 years ago. Eric Evans published Domain-Driven Design in 2003, and the core concept was the "ubiquitous language" — a shared vocabulary that developers, domain experts, and the codebase itself all use consistently. Every conversation, every variable name, every API endpoint uses the same terms to mean the same things.
Evans told InfoQ recently: "Training a language model on a ubiquitous language of a bounded context makes it far more useful for specific needs compared to using generic LLMs." Your domain glossary is now a prompt-engineering asset.
Pocock built a skill for this too. His "ubiquitous language" skill scans your codebase, extracts terminology, and generates a markdown file full of definition tables. He reported that by reading the AI's thinking traces, the ubiquitous language didn't just improve planning. It made the AI think less verbosely. Fewer wasted tokens. More accurate implementation. The generated code actually matched what was planned.
DDD's bounded contexts map here too. In a multi-agent setup, unclear term ownership causes "context bleed" — agents stepping on each other's domain, duplicating logic, or contradicting each other's output. The same failure mode Evans described in human teams 20 years ago, now showing up in agent orchestration.
The principle: Build a glossary. Define your domain terms precisely. Reference it in your project docs and your prompts. If the AI knows that "order" means "a customer commitment in the Order Management context" and not "a sort directive," it stops guessing and starts generating code that fits.
3. Test First, Ship Small
You've aligned on the plan. You speak the same language. The AI builds exactly what you asked for. And it doesn't work.
This is the failure mode that burns the most time, because the code looks right. It reads well. The variable names make sense. But it falls over the moment you run it, because the AI produced 400 lines in one shot without checking any of them.
The Pragmatic Programmer calls this "outrunning your headlights." The rate of feedback is your speed limit. Drive faster than your headlights can illuminate, and you'll hit something you didn't see coming. AI agents, by default, drive with their lights off. They produce huge batches of code and then think, maybe, about verifying it afterward.
Anthropic's best practices recommend a writer/reviewer pattern: "One Claude writes tests, a second writes code to pass them." OpenAI's Codex documentation says it plainly: "Without tests, Codex verifies its work using its own judgment. Tests create an external source of truth."
Both platforms landed on the same answer: test-driven development. Not as a philosophy. As a mechanical constraint that forces the AI to take small steps.
Write the test first. Let the AI make it pass. Refactor. Red-green-refactor isn't a relic from the Agile era. It's the feedback loop that prevents entropy. Each cycle costs a fraction of what a big-batch rewrite costs in tokens, context window space, and your own review time.
A May 2025 arXiv study on AI-generated code found that code smell density increases as agents move from isolated scripts to multi-module systems. Without test boundaries, the rot scales with the codebase. With them, each module stays honest.
The principle: TDD isn't optional when coding with AI. It's the speed governor that keeps the output trustworthy. Write the test, let the agent implement, verify, repeat. Small cycles, high confidence.
4. Build Deep, Not Wide
Even with tests passing and the right thing being built, there's a structural problem that shows up as your codebase grows. The AI starts getting lost. It can't find the right file. It misunderstands dependencies. It makes changes that break things three modules away.
John Ousterhout described this in A Philosophy of Software Design. He drew a distinction between deep modules and shallow modules:
- Deep modules hide a lot of functionality behind a simple interface. You don't need to understand the internals to use them.
- Shallow modules are the opposite: not much functionality, but a complex interface that forces you to understand everything underneath.
AI agents are exceptionally good at generating shallow modules. Lots of tiny files, each doing one small thing, each exposing multiple functions, each depending on three other tiny files. It looks clean. It reads well in a code review. But it creates a navigation nightmare for the next AI session, because the agent has to walk through dozens of interconnected blobs to understand what your code actually does.
Pocock demonstrated this visually in his talk. A codebase full of shallow modules looks like scattered dots with tangled arrows between them. The same code reorganized into deep modules looks like a few large blocks with simple connections on top. The AI navigates the second structure and generates better code inside it, because it can reason about the interface without loading every implementation detail into its context window.
The arXiv paper on AI-generated code smells confirmed this empirically. As agents move from isolated scripts to multi-module systems, code smell density increases. LLMs don't track architectural complexity at inference time. The more fragmented your module structure, the more the AI's output degrades.
There's a market signal here too. TypeScript became GitHub's #1 language in August 2025. Part of the reason: typed, well-structured code makes AI-assisted development more reliable. Developers are self-selecting toward architectures that compound AI returns, the way a well-diversified portfolio compounds market returns. The ones clinging to untyped, fragmented codebases are paying for it in rework.
The principle: Audit your codebase for shallow modules. Wrap related code into deep modules with simple interfaces. Test at the boundary. The AI doesn't need to see everything. It needs to see the right interfaces.
5. Design the Interface, Delegate the Implementation
If the first four principles protect the quality of AI output, this one protects you.
Raise your hand if you've felt more mentally exhausted than ever since AI coding tools became part of your workflow. You're not alone. Pocock asked his conference audience the same question. Almost every hand went up.
The exhaustion comes from trying to review everything. Every generated function, every refactored class, every new file the agent created. You're shipping more code than ever before, but your brain is still the bottleneck for understanding all of it.
Kent Beck's advice: "Invest in the design of the system every day."
The specs-to-code movement does the opposite. It divests from design. It treats the codebase as disposable output you regenerate from a prompt. And that's how you end up reviewing 400 lines of code you didn't write and can barely follow.
The alternative is the gray box model. You own the interface. You own the tests at the boundary. You let the AI handle what's inside the module. For non-critical modules, you don't need to review every line of implementation. You need to verify the contract works.
This is what Karpathy means when he says the new core skill is "judgment — what to delegate, how to specify it, how to review it fast." You're not writing less code because you're lazy. You're writing less code because you're spending that time on architecture, interfaces, and verification. The strategic layer.
Pocock frames it as the difference between a tactical programmer and a strategic one. The AI is the tactical programmer, the sergeant on the ground making code changes. You're the one thinking about system design, module boundaries, and how the pieces fit together. That's not a demotion. It's a promotion.
Anthropic's internal data shows 2–3x productivity gains on large-scale refactoring tasks. But those gains come from teams that trust the architecture enough to delegate confidently, not from teams reviewing every generated line.
The principle: Write the interface. Specify the contract. Delegate the implementation to AI. Verify through tests, not line-by-line code review. Your job is system design. Let the agent handle the rest.
The Toolkit: From Principles to Practice
Principles without tools are just advice. Here's what you can install today.
Pocock published his skills as an open-source repo. Each one maps directly to the principles above:
- grill-me forces shared understanding before the AI writes anything (Principle 1)
- ubiquitous-language scans your codebase and builds a domain glossary (Principle 2)
- tdd enforces the red-green-refactor cycle per module (Principle 3)
- improve-codebase-architecture identifies shallow modules and wraps them into deep ones (Principle 4)
- writer-prd specifies module changes and interface contracts inside the PRD (Principle 5)
But these principles aren't just one developer's opinion. The major platforms independently built infrastructure to enforce the same ideas.
Anthropic's Claude Code uses CLAUDE.md as the alignment contract, supports subagents scoped to isolated tasks with clean context windows, and recommends a writer/reviewer pattern where one agent writes tests and another writes implementation. OpenAI's Codex uses AGENTS.md for the same purpose, with explicit "done when" criteria that force test verification before the agent considers a task complete. GitHub Copilot's agent mode builds a semantic index of your repo (37.6% better retrieval accuracy than early 2025) and supports custom agents and prompt files as reusable blueprints.
Three platforms. Same structural answer. The alignment doc, the test loop, the modular boundaries. They all converged because they all hit the same wall: model capability isn't the bottleneck. Code architecture is.
The Through-Line
Brooks wrote about shared design concepts in 1975. Evans published ubiquitous language in 2003. Ousterhout drew the deep module diagram in 2018. Beck has been saying "invest in design every day" for decades.
Karpathy discovered the same lessons in a matter of weeks, under pressure, building with the most capable AI models on the planet. Pocock distilled them into skills that went viral because thousands of developers recognized their own pain.
None of this is new knowledge. That's the point.
A clever prompt gets you one good output. A well-structured codebase gets you a thousand. Every clean interface, every enforced test boundary, every deep module compounds the value of every AI interaction that follows. Skip the architecture, and you're compounding the debt instead.
Code is not cheap. Your architecture is your AI strategy.
Claude's New Dynamic Workflows Changed How I Think About AI Coding How Claude Code's latest feature compares to Cursor, Copilot, and Codex — and why reusable orchestration matters
What Makes Anthropic's New Finance Agent Different I cloned the 10 templates Wall Street is talking about. Four patterns stood out — and one gap, 88 days before the AI…
Anthropic's Engineer Said Kill Markdown. Here's What He Actually Meant. HTML vs Markdown : Here's the Decision Tree Both Sides Needed.
Before you go! 🦸🏻♀️
If you liked my story and you want to support me:
- Throw some Medium love 💕(claps, comments and highlights), your support means the world to me.👏
- Follow me on Medium and subscribe to get my latest article🫶
About - Yanli Liu - Medium Read writing from Yanli Liu on Medium. Daytime finance practitioner based in Luxembourg, seasoned coder, and passionate…