The Bitter Lesson of Agentic Coding

April 2026

Peter Zatloukal

Audience: Software engineers building toward autonomous coding loops.
Reading time: ~15 minutes.

In 2019 Rich Sutton published “The Bitter Lesson,” a short essay that became one of the most cited arguments in AI research. His claim: across 70 years of AI, general methods that leverage computation consistently beat approaches built on hand-engineered human knowledge. The pattern repeated across chess, Go, vision, and speech. The lesson is bitter because the domain knowledge feels like it should matter. It does matter, briefly, and then computation catches up and passes it.

There is a new bitter lesson, and this one is for software engineers.

The original bitter lesson told ML researchers: your hand-crafted features and domain heuristics will be crushed by learned representations. The new bitter lesson tells us: your hand-crafted implementations, your carefully engineered code, your hard-earned ability to write precise solutions, will be less valuable than carefully defining what you want and letting the model figure out how to build it. The bitterness comes from the same place: the skill you spent years developing is the exact skill you need to let go of. For ML researchers, it was feature engineering. For software engineers, it is writing code.

The instinct to over-specify, to write the implementation ourselves, to control every step, is not just an architectural mistake, it is a professional reflex that actively prevents us from getting the best results from the tools we now have. Structure still matters, but not the kind you’d expect. That is the bitter part: the structure you need is not the structure you spent a career learning to build.

Define what done looks like. Let the model figure out how to get there. Build the simplest harness that verifies the work.

In short:

Today’s bottleneck is not the model. It is how you hold it.
Past a certain complexity, prompting harder is an architectural error. The fix is structural.
Specs are not documentation, they are the control plane for autonomous agents.
The right harness is thin on orchestration and thick on verification and memory.

I reached this view while building a seven-stage handwriting style-transfer pipeline where numerical metrics and human judgment regularly disagreed. The full account is below. The short version: spec-driven turns with adversarial review held up.

The One-Shot Ceiling

If models keep getting better, why bother with methodology at all? Why not just describe what you want and let the model build it?

Short answer: you can, up to a point. That point moves outward with every model generation. What frontier models can one-shot today would have been impossible a year ago. Models are genuinely excellent at generating complete, working software from a description.

But there is always a ceiling, and it is structural, not a temporary limitation waiting to be patched. Five properties of complex projects guarantee it:

1. Human intent is discovered, not transmitted. Users refine what they want by reacting to what they see. A one-shot denies the user this feedback opportunity. The result may be technically correct and still miss the point, because the point only became clear through iteration.

2. Requirements are underspecified. A prompt to “add authentication” implies dozens of decisions (session lifetime, token storage, error UX, rate limiting) that the requester hasn’t articulated and may not have opinions on until they see a working version. An agent making all these decisions at once will get some wrong, and the cost of correcting compound errors is higher than making them incrementally.

3. Errors compound non-linearly. A wrong abstraction in step 3 of 20 doesn’t just make step 3 wrong. It warps steps 4 through 20 to fit the bad abstraction. In an iterative loop, the wrong abstraction is caught at step 4 and corrected. In a one-shot, the agent builds confidently on a flawed foundation because it has no external signal that anything is off.

4. Context degrades with scale. As generation grows, early decisions (architecture, naming, module boundaries) get diluted by later code. The agent loses coherence with its own earlier choices. Sustained coherence requires periodic re-grounding from persistent artifacts.

5. Verification requires a different mode than generation. When the same agent generates and evaluates in a single pass, it is biased toward confirming its own choices. Better models do not fix this because the bias comes from the structure of doing both jobs in one pass, not from insufficient intelligence.

These are not model limitations, they are properties of complex systems interacting with sequential decision-making. Better models push the ceiling higher but do not remove it.

I call this the state-of-the-one-shot: the upper bound on project complexity that a model can reliably handle in a single pass at any given moment. It is real, it is impressive, and it rises with every model generation, but it is always a bound.

If you work within the state-of-the-one-shot, you do not need methodology, specs, or verification loops. This is vibe coding, and it works because the model is genuinely capable enough for the complexity you are targeting. Vibe coding is not the problem. Mistaking the ceiling for the sky is.

When you hit a project that exceeds the current state-of-the-one-shot, you get the pattern everyone recognizes: 80% done fast, then “make it better” produces lateral movement instead of convergence. The agent generates plausible changes that do not get closer to the goal, because there is no verification contract defining what the goal is, and no structural separation between generation and evaluation. You are stuck at 80%, and more prompting does not unstick you, because the obstacle is architectural, not linguistic.

The rest of this post describes a methodology for the complexity band above the one-shot ceiling: the space between “the model can handle it alone” and “no current approach works.” That band is where the interesting engineering lives.

Spec-Driven Development: The Control Mechanism

The spec is the answer to the one-shot ceiling. Not a spec in the traditional sense of a document that defines requirements for humans, but a spec tuned for a specific purpose: giving an autonomous agent and its review loop a concrete target to verify against. For longer-running loops, the spec also anchors memory across sessions, so each new turn inherits not just what to build but what “done” means.

The problem the spec solves is drift. Agents without concrete acceptance criteria optimize for making tests pass rather than solving the problem. “Works but not good enough” stays vague indefinitely. An agent will happily generate code that compiles, passes type checks, and satisfies a test suite, while completely missing the actual goal. I have watched this happen repeatedly on complex projects.

Nicholas Carlini demonstrated why this matters at scale. His 16 parallel Claude instances produced a 100,000-line C compiler in Rust, passing 99% of GCC’s torture test suite, compiling Linux on three architectures. Nearly 2,000 sessions. Two weeks. Approximately $20,000. The breakthrough is a marvel, but the detail that changed how I think was where Carlini invested his time. Not in engineering prompts, but in building test harnesses. His core insight: the quality of your verification loop determines the ceiling of your agent’s output.

“Claude will work autonomously to solve whatever problem I give it. So it’s important that the task verifier is nearly perfect, otherwise Claude will solve the wrong problem.”
– Nicholas Carlini

A well-written acceptance criterion is worth more than a well-written prompt, because it tells both the agent and the review loop what to verify. When I write a spec, I define what done looks like in concrete, checkable terms. The spec sits upstream of everything else: code review checks spec alignment, the test strategy checks criteria coverage, architecture review evaluates whether the design serves the spec’s goals. Without the spec, the rest of the system has nothing to anchor to.

Turns, Convergence, and the Autonomous Loop

I use the term turn to describe one complete pass through the spec-implement-evaluate cycle. A turn starts with a spec (or an inherited proposal from the previous turn), proceeds through implementation and review, and ends with a retrospective and a proposal for the next turn. The proposal is written to disk, so a fresh agent session can pick it up without depending on conversation memory.

This is the whole methodology in one diagram.

The evaluate phase is where most systems fail, and fixing it requires structural separation. Prithvi Rajasekaran formalized this in his work on harness design for long-running development. He showed that separating generation from evaluation, even when the same model does both, produces measurably better results. His generator-evaluator architecture is modeled on GANs: the generator proposes, the evaluator critiques, and the tension between them drives quality upward.

“When asked to evaluate work they’ve produced, agents tend to respond by confidently praising the work, even when, to a human observer, the quality is obviously mediocre.”
– Prithvi Rajasekaran

This is why self-review does not work. You need structural separation. A dedicated evaluator, tuned to be skeptical, with concrete criteria to check against. The objection writes itself: if you cannot trust the model to implement correctly, why trust it to evaluate? Because it is the structural separation that prevents the bias, not a smarter judge. An evaluator in a fresh context, checking against concrete spec criteria, does not exhibit the same self-congratulatory pattern. The human’s job is writing those criteria.

This loop works at every scale. At the small end, a single engineer runs a few dozen turns on a feature, with a human stepping into the EVALUATE or PROPOSE phase whenever judgment is needed. At the large end, the same structure scales to fleets of parallel agents. Carlini’s compiler is this loop: 16 agents, 2,000 sessions, each one a turn through spec-implement-evaluate, with the GCC torture suite as the verification contract. The loop did not change. What changed was the number of agents running it, and the fact that a comprehensive enough test suite could stand in for human judgment. That is what full autonomy looks like: not a different loop, but one where the verification contract is strong enough that humans do not need to be in it.

Geoffrey Huntley’s “Ralph Wiggum Loop” was an early articulation of this pattern: pick a task, implement, validate, commit if passing, reset context. Progress lives in files and git, not in the model’s context window. What I add is structured handoff: the proposal artifact that carries not just “what to do next” but “what we learned” and “what surprised us” from the current turn, and when needed, a human-in-the-loop inspection of output to inform convergence decisions. Context loss at turn boundaries is a real failure mode. Rajasekaran identified “context anxiety,” where models prematurely wrap up work as the context window fills. The fix is not bigger context windows. It is architecture that assumes context will be cleared: progress written to disk, structured handoff artifacts between sessions, each session starting fresh with full access to prior state.

Convergence is what happens when the review-fix-review cycle measurably produces fewer issues each iteration. Used informally here: this is a heuristic stopping rule, not a fixed-point guarantee. A circuit breaker tells you when to give up, convergence tells you when you are done. The distinction matters: a converging system is producing value with each iteration. A system that hits its circuit breaker has failed to converge, and you need to understand why.

The Harness Problem

“Every component in a harness encodes an assumption about what the model can’t do on its own, and those assumptions are worth stress-testing.”
– Prithvi Rajasekaran

This is the most important sentence in the harness design literature. Every piece of scaffolding you build is a bet against model improvement. Some bets are good (verification will always matter). Some are bad (the model needs to be told how to structure a function).

Daniel Miessler sharpened this into a warning with his concept of the “BLE-hobbled system” (Bitter Lesson Engineering-hobbled): a system where scaffolding has aged past its usefulness and is now actively making the overall system worse.

This is not hypothetical. Lance Martin at LangChain described exactly this experience, watching a carefully designed multi-agent research system become a bottleneck as models improved. The structural constraints he had built around earlier model limitations, from avoiding tool calling to hard-coded agent decomposition, prevented his system from benefiting from newer capabilities as they arrived. The scaffolding he built to help was now the thing holding him back.

Boris Cherny, the engineer who created Claude Code at Anthropic, arrived at the same conclusion from the practitioner side. He adopted Sutton’s bitter lesson as a core design principle for the Claude Code team: bet on the general model, not on scaffolding around it.

Scaffolding might improve performance 10-20%, but those gains get wiped out with the next model generation.
– Boris Cherny, paraphrased

The reason those gains get wiped out is that coding model capability does not improve linearly. It arrives in step changes. The release of Opus 4 marked the discontinuity where Claude Code’s growth went exponential. The scaffolding you built to compensate for the old model’s weaknesses becomes the thing preventing you from benefiting from the new model’s strengths. Not gradually, but all at once when the next model drops.

My approach: build the minimal harness that provides verification, context continuity, and safety gates. Everything else is the model’s job.

Concretely, my harness consists of:

A spec skill that defines acceptance criteria and manages turn transitions. This is the verification contract.
Adversarial review skills (code review, security review) that evaluate output against the spec. These are the evaluator in the generator-evaluator pattern.
A pre-push hook that blocks code from leaving the machine until review passes. This is the quality gate.
Persistent review files (checked into git) that carry context across sessions. This is the inter-session memory.
Minimal coding conventions that target specific failure modes (revert on regression, stop after two failed fix attempts, write tests in the same increment as functionality).

That is it. No elaborate prompt chains. No multi-step reasoning frameworks. No rigid agent orchestration graphs. The skills are Markdown files. The hooks are bash scripts. The conventions are plain text. If Claude Code gains a serious competitor or a different model pulls ahead, the work to port is swapping invocation syntax, not rethinking my architecture.

The harness is deliberately minimal because the bitter lesson says it should be. The “Building Effective Agents” guide makes the same argument from the framework side: start with the least complex agent pattern that works, add structure only when earned by real failure modes. The models will get better. The verification will still matter. The scaffolding in between should be as thin as possible so you benefit from improvements you did not anticipate.

Where This Actually Works

Philosophy is cheap. I wanted to know if these principles hold up when applied to a real project over multiple iterations.

I have been using this system to build a diffusion-based handwriting style-transfer pipeline: a computer vision project that takes a photograph of someone’s handwriting and generates new text in that style. The pipeline has seven sequential stages and uses multiple ML models. It is the kind of project where a threshold change in one stage cascades through downstream stages in non-obvious ways, where automated metrics and human perception diverge, and where you need human-in-the-loop data during the development cycle.

This is a single-engineer project, not a production system at scale. Carlini’s compiler is the evidence that verification-first principles hold at 100,000 lines and 16 parallel agents. What my project tests is the methodology: whether spec-driven turns, adversarial review, and structured handoff hold up across a real ML development cycle where the complexity is in cascading dependencies and metrics ambiguity.

The project has gone through multiple complete turns, each driven by a spec with concrete acceptance criteria like “height outlier ratio below 0.15” and “OCR accuracy above 0.7 on curated hard words.” The spec-driven approach forced me to define what “better” means numerically before implementing changes. That discipline is what prevented the kind of drift that kills complex ML projects: where you keep tweaking parameters and think things are improving because the output looks different, not because it is measurably better.

The adversarial review layer caught real problems, like a post-processing defense layer that was clipping ink from the right edge of generated words, visible only in diagnostic output that the review process forced me to create. Without structured review criteria anchored to a spec, that bug would have survived as “sometimes the output looks a little off.”

What I Think Is Durable

Some of what I have described will be obsolete in a year. The specific model choices, the exact cost curves, the particular failure modes that my coding conventions target.

Here is what I think survives:

Verification as the ceiling. This is Carlini’s insight and it is structural. No matter how good the model gets, you cannot trust output you cannot verify. The investment in test suites, review mechanisms, and quality metrics will compound indefinitely.

Not all verification is equal. An oracle (a ground-truth reference, like Carlini’s diff against the GCC torture suite) is rare and precious. A proxy (a measurable number that stands in for the real goal, like my pipeline’s “height outlier ratio below 0.15”) works until it diverges from what you actually care about. This is Goodhart’s Law: when a measure becomes a target, it stops being a good measure. A critic (another model reading the output) is cheapest and weakest, because the critic shares blind spots with the generator when it is the same model family. The strongest work has all three. Most work has only a critic, and that is the ceiling you should worry about. A real model regression is an oracle one level up, for the harness rather than the code: when the agent silently degrades and the verification loop still gates the bad outputs, the design holds. Thinking Depth Regression is what that test looked like for zat.env in April 2026.

Spec as the control mechanism. Agents need concrete acceptance criteria. The form may change, but the need for a human-legible verification contract that defines “done” will not go away. The spec is where human judgment enters the loop. The human writes the spec (the 20% that requires judgment). The agent implements, reviews, and iterates (the 80% that benefits from compute). The verification loop is the interface between them.

Turn-based iteration with structured handoff. Context windows will grow, but the fundamental problem of context degradation over long sessions is architectural, not just a capacity limitation. Periodic resets with deliberate transfer of what was learned will remain necessary for complex projects.

Effort-aware compute. The spread between cheap and expensive models will widen. Carlini’s compiler cost $20,000, a number that will be dramatically lower within a year or two. Systems that match effort to task difficulty, escalating when the review loop detects stalls, will outperform systems that use a single tier.

Minimal harness design. Do not over-engineer the scaffolding. Build for verification and context continuity. Let the model handle everything else. Stress-test your assumptions regularly, and remove scaffolding when the model no longer needs it.

I think about autonomy as a spectrum: supervised (human reviews everything), gated (automated review must pass before code leaves the machine), autonomous (review-fix loops run without human intervention), and multi-agent (parallel agents with shared verification state). My system is currently at gated, moving toward autonomous. The architecture for autonomous operation is also the architecture that makes supervised and gated operation better. Every investment in verification, spec quality, and structured iteration pays off today, even before the loop closes.

I think the endgame is systems that design, deploy, and self-optimize entire fleets of coding agents. I write more about that in Agent-Hypervisors.

Where This Fails

Bad specs create false confidence. A verification loop is only as good as the criteria it checks against. If your acceptance criteria miss the point, the system converges confidently on the wrong target.
Evaluation loops can overfit to measurable criteria. The things you can measure are not always the things that matter. The handwriting pipeline taught me this: CV metrics said height consistency was fine while human review said sizes varied 2.8x.
This method is overkill below the one-shot ceiling. If the model can handle your project in a single pass, adding specs and turns and review loops is pure overhead. Know where the ceiling is before you build the scaffolding.
Domains with weak verification remain bottlenecked by human judgment. The methodology assumes you can define “done” in checkable terms. For work where quality is subjective (UX, creative writing, design), the verification loop shrinks to “a human looks at it,” and the gains are smaller.

What It Actually Feels Like

I promised this post was about the bitter lesson, so let me tell you what the bitterness actually feels like.

My journey through AI-assisted coding was probably typical: copy-paste from ChatGPT. Tab-completion in Cursor. Having the agent write code, then driving all code creation through prompts. But I was still in an IDE, so at any given moment I could review and modify the code directly. I was still a programmer, just a faster one.

Then I switched to Claude Code, a terminal interface. No IDE. No syntax-highlighted editor pane. Just a prompt and an agent. Same model I had been using in Cursor, and basically the same capabilities. But it felt completely different. In the IDE, I had the illusion of control. In the terminal, I was typing instructions and trusting the agent. One part of me thought: I have been shipping software for thirty years and now I am typing wishes into a terminal. Another, deeper part thought: can this stuff really work?

Yes, it works. On simpler projects, a frontier model takes you far without methodology. On complex ones, you need specs, acceptance criteria, turns, convergence checking, or the agent produces impressive-looking output that quietly drifts from the goal. And this cycle will repeat with every step-function in model capability: yesterday’s needs methodology becomes tomorrow’s just ask the model, and the frontier of what requires structured verification moves outward.

But the real punchline, the thing I gained that I did not expect, was a view of how all of this will actually go: more and more end-to-end complex solutions like Carlini’s compiler. Getting there requires internalizing the bitter lesson, letting go of writing the code and investing in defining the outcomes, trusting the verification loop instead of your eyes on the diff.

I am a real software engineer. Using these tools beyond the level of a vibe coder requires me to be one. For now. The spec does not write itself. The acceptance criteria do not emerge from thin air. The judgment about when the system is converging versus spinning, about what the next turn should target, about which architectural bet to make, which tech and platform to select, that is engineering. It is just a different kind of engineering than the one I spent thirty years learning.

That is the bitter lesson. It is bitter because the old skill mattered and was hard-won, and it is a lesson because the new skill matters more.

Start Here

If you want to try this approach, here is what I would do Monday morning:

Write a spec before you start your next feature. Not a design doc. A list of concrete acceptance criteria that define done. Make them checkable: “OCR accuracy above 0.7,” not “improve text recognition.”
Separate generation from evaluation. Even if you are using a single model, do not let the same session that wrote the code judge the code. Run a review pass with adversarial criteria. Check against the spec.
Make progress survive across sessions. Write your findings to disk. When the next session starts, it should be able to pick up from a file, not from your memory of what happened.

zat.env is a minimal harness for reliable autonomous coding with Claude Code on Linux. Think of it as a reference implementation of the patterns above, written to be easy to strip for parts. The repo implements the thinnest layer that matters: specs as the control plane, adversarial review as the verification loop, persistent artifacts as inter-session memory, and a pre-push gate that blocks unreviewed code from leaving the machine. Skills are Markdown. Hooks are bash. Conventions are plain text. If you only open one file, open /spec. The review loop, the pre-push hook, and the handoff artifact all read from it. Get this wrong and the rest doesn’t help. The whole thing is designed to be replaced by something better, which is kind of the point.

References:

Sutton, R. (2019). The Bitter Lesson. The foundational argument that general methods leveraging computation beat hand-engineered domain knowledge.
Martin, L. (2025). Learning the Bitter Lesson. A detailed case study from LangChain of watching rigid agent scaffolding become a bottleneck as models improved.
Miessler, D. (2026). Bitter Lesson Engineering. Coined “Bitter Lesson Engineering” and the concept of a “BLE-hobbled system.”
Carlini, N. (2026). Building a C Compiler with a Team of Parallel Claudes. Anthropic Engineering. 16 parallel Claude agents produce a 100K-line compiler.
Rajasekaran, P. (2026). Harness Design for Long-Running Application Development. Anthropic Engineering. Generator-evaluator architecture for extended autonomous sessions.
Cherny, B. (2026). Head of Claude Code: What happens after coding is solved. Lenny’s Podcast. The creator of Claude Code on adopting Sutton’s bitter lesson as a core design principle.
Anthropic (2024). Building Effective Agents. Foundational taxonomy of agent design patterns.
Osmani, A. (2026). The 80% Problem in Agentic Coding. The gap between what agents generate rapidly and the remaining work that requires human judgment.
Huntley, G. (2025). The Ralph Wiggum Loop. Stateless iteration with file-based continuity.