First Fable

Audience: Software engineers building toward autonomous coding loops.
Reading time: ~16 minutes.

TL;DR. Every agentic developer I know has a story about Opus 4.6. It was the release that reset what agentic coding could be, a plateau you had to stand on to understand. For me, it’s most of the reason this blog exists. So when Fable 5 arrived under even bigger hype (the reporting had it briefly blocked from release as too dangerous to ship, with the Mythos family going to security researchers first so online defenses could keep up), the question that actually interested me wasn’t whether it was good: it was whether this was another 4.6.

I slept on it. The answer is yes but no. Fable is a clear step up from Opus 4.8, and if the economics worked I’d make it my daily driver tomorrow without a second thought. One stretch of it, watching it rebuild Zork from scratch on my own small engine, did feel kind of magical. But a magical afternoon isn’t the ground moving under the whole field, and that second thing, the 4.6 moment, wasn’t here. Expecting something of that magnitude was probably a mistake in my own thinking. A step up is not a step function. Unless telling those two apart is above my skill level, which is the one thing this whole post can’t quite rule out.

Worth spelling out why 4.6 hit like that, because it sets the bar for everything below. I’d loved the frontier models before it and used them every day, so this isn’t nostalgia. 4.6 was different in kind. It was the first release where it felt like a solo developer with a decent harness and good habits could sit down and build more or less anything, and have it actually hold together. The nearest word I have for how that felt is near-religious, and going by everyone I talk to, I’m not reaching for it alone.

In April I published The Bitter Lesson of Agentic Coding. Two of its claims were really predictions dressed up as claims. First, that coding capability doesn’t improve smoothly, it moves in steps. Second, that when it steps, part of your scaffolding gets wiped out (whatever you built to prop up the old model) and part of it keeps paying off (verification, specs, the turn loop). I believed both when I wrote them. The first one I’d already lived through: 4.6 was a step if there ever was one. What I hadn’t done was watch a jump land under controlled conditions, everything else held still and one variable moved.

Then on July 2, Fable 5 showed up in my CLI. It’s Anthropic’s first Mythos-class model, a notch above Opus, and it landed at an almost suspiciously convenient moment. I had a project sitting at a clean stopping point, waiting for exactly this kind of test. Daydream is a small multiplayer web game I’ve been building one reviewed increment at a time, and the platform underneath it was, on purpose, overbuilt: about twenty-two thousand lines of tests against fifteen thousand lines of actual engine, golden baselines for everything the local models generate, and one headline feature that the README described in detail and that I’d deliberately left unbuilt. An empty chair, basically, waiting for a bigger model to sit down in it.

One thing to pin down first, because it threads through everything below: daydream runs on two very different models. A frontier model does the design-time work, the planning and specs and engine-building, the actual agentic coding. That’s the one I’m swapping. The other is a small local model that runs at game time on the box’s own GPU, a Qwen 7B, and it does the live language work while people play: it reads what a player types, writes the dialogue and narration back, and composes the prompts that turn each room into a watercolor. The 7B never changes across this experiment. Same modest on-device model before Fable and after. So as a rule of thumb, when I say something stepped I mean the design-time model, and when I say something didn’t move I usually mean the 7B.

So instead of just upgrading and moving on, I treated the switch as an experiment. Nothing else changed: same repo, same me, same bare-bones harness. I dropped Opus 4.8, brought in Fable 5, left effort maxed on both, and wrote down what I expected to happen before each turn so I couldn’t fudge the grading afterward. The blow-by-blow lives in FIRST-FABLE.md in the daydream repo, one long day of append-only notes. What follows is what I take from it.

Here’s the part that stuck with me. Midway through the day, with every test green and a playtest behind me, I’d written down that I wasn’t convinced this was a real magical step-function. By that evening I was calling it “pretty amazing.” Same project, same afternoon, and my read on it moved a long way. How impressive Fable is in the abstract isn’t something I can settle for anyone else. What I can do is show what the swap did and didn’t prove about the way I work, and let the evidence decide the 4.6 question. Thinking Depth Regression was this harness getting tested by a model that quietly got worse. This was the same setup with a model that suddenly got better.

One Variable

Any experiment like this is only as good as its control, so let me be clear about what was held fixed.

Every daydream increment before this had run through the same loop with Opus 4.8 in the driver’s seat: plan mode, then /spec to write an acceptance contract, then small increments (each with its own tests) and an adversarial review pass before anything left my machine. Those Opus turns were genuinely good. v0.3.0 landed 18 of 19 acceptance criteria across ten increments; v0.4.0 went eight for eight. But I steered them. I answered the model’s questions, nudged it when it aimed too low or too high, and now and then caught an increment that was shaped wrong before it got worse. My own retrospective from that era puts it bluntly: target selection is the whole game. The harness was good at checking work. Picking the right big thing to build, at the right altitude, is where I was still doing the driving.

One thing I can say for certain is that the harness wasn’t quietly tuned for the new model, because I can go check. zat.env’s most recent commit is June 29, before Fable ever reached me. Grep the whole history and you won’t find the word “Fable” or the string “Opus 4.8” anywhere in it. The effort settings were last touched on April 17 and still say, in as many words, that they’re tuned for Opus 4.7. The harness had no idea what was coming.

The other thing I did was write the predictions down first. Before each turn I put the predictions and the measures into FIRST-FABLE.md, and I put the ways I could be proven wrong in there too. I defined up front what “no step function” would look like: the spec needing real rework once coding started, more than two fix cycles on any single criterion, a BLOCK in review, the riskiest feature shipping at the bottom of its fallback ladder, or me having to steer the implementation the way the old retrospective describes.

And I picked the sharp question on purpose. Whether the model could implement a spec was already settled; Opus 4.8 could do that. What I actually wanted to know was whether judgment scales. Hand it one open-ended prompt and does it pick the right ambitious target, design it well against real constraints, and hand back work clean enough that the rest of the loop barely needs me?

Turn One: Dreamseeds

Here’s the whole prompt I gave it to start planning, word for word:

Examine the state of this project, noting the aspirational parts of README.  Design a
meaningful and sizable improvement that will advance our aims to be passed to /spec

Use your judgement on approach, alignment, and impact.  Be creative.

In doing this, examine all docs and backlogs and consolidate, modify, or delete as you
see fit to support future work.  We've just gotten to a major increment (basic game,
basic UI, key "semi-procedural" insight in README.md), and it's time to flex the
foundation a bit more.  Consider everything from stuff that will add to the
storytelling to infrastructure, to a mix of this and other stuff.  Ask questions as
needed.

No file paths, no hints about what to build. What came back was the thing I’d been curious about. It sent a couple of subagents through the code and docs, came back with six pieces of drift nobody had logged (two of them systems that were quietly dead in production: voice pools wired to NPCs that no longer existed, and an NPC memory table sitting at zero rows), and proposed a feature I’d been calling Dreamseeds: a quest-earned seed a player plants to grow one new, model-written room inside authored guardrails, permanent, for everyone. This wasn’t a safe little adjacent feature. It was the exact thing the README had been pointing at for months, sitting behind hooks I’d documented and never built. I told it to go ahead, unchanged.

Then the numbers came in, and they were better than the Opus baseline had led me to expect.

I cleared the context and started a fresh session to actually build it. That session never saw the planning conversation, rebuilt everything it needed out of SPEC.md, and shipped every design decision as written, no amendments. My total input to it was the word “implement,” plus one “continue” after an outage. No corrections, no course changes.

The outage is the strangest part. For about half the session the permission classifier that approves shell commands (Claude’s own infrastructure, not my box) was down, so nothing could run. Instead of stalling, the session wrote seven increments and around ninety tests completely blind, then ran the whole batch the moment commands came back. First try: 74 of 74 new tests passed, then all three test tiers, with nothing to fix afterward. Earlier turns had leaned on a tight write-run-fix rhythm. This one had that rhythm taken away and didn’t seem to miss it.

Review still earned its place. The first /codereview pass came back 0 BLOCK, 1 WARN, and the WARN was real: a cross-feature interaction that could drop a player’s WebSocket connection, which ninety feature tests had sailed right past and one skeptical read caught. And the riskiest thing I’d predicted, whether the little local 7B could compose a decent room inside scaffolding the frontier model wrote for it, worked on the first attempt. I’d budgeted two rounds of prompt tuning for that. It needed zero.

The line from my notes still fits: it didn’t feel like speed and it didn’t feel like magic. What it felt like was the friction going out of the judgment layer, the part where I usually have to lean in.

Green Is Not Good

Then I actually played the game. On paper everything was green: three test tiers, GPU render probes, two adversarial reviews, even a full end-to-end playthrough the session had driven itself over a WebSocket. Then, minutes into playing, I talked to an NPC and the game told me, “A soft smile plays on your lips as you wave back.” I hadn’t smiled at anything. The dialogue code was told in one place to narrate the player in the second person and in another that it was the NPC, so it did the logical thing and put the NPC’s smile on my face.

Six findings came out of that one sitting. The two deep ones (the smile, and NPC memory that had never once fired in production because two naming conventions didn’t line up) had been in the code since the Opus era, through every green run and both adversarial reviews. The only new thing was that someone finally played. The other four were mine, from this turn: nothing broken, just moments I hadn’t imagined hard enough, like a quest item that appeared with no text or a seed phrase that clearly said “down” opening an exit east.

So that afternoon I wrote down, in the same file as the predictions, that I was not convinced this was truly a magical step-function.

That skepticism earned the best line in the whole document: green is not good. Every verifier caught every structural problem and none of the experiential ones. Code that’s correct on the first pass isn’t automatically any good on it, and the only thing that noticed the gap was a person playing for fun. I’d written in April that weak-verification domains stay bottlenecked on human judgment. What I’d missed is that this isn’t only true of obviously subjective work. It lives inside a heavily-verified project too, right at the end, where correct has to turn into good.

The recovery is the other half, in fairness. The fix round ran as smoothly as the build: six complaints in, and what came back was real forensics against the live database (it found the smile bug verbatim in the event log and solved a mystery I’d pre-registered days earlier, why the memory table held exactly one row), fixes placed at the right depth rather than six patches, a regression test for each, and no follow-up corrections. Criticism went in as cleanly as a spec. The bottleneck had just moved, from whether the thing gets built right to whether it feels right, and that judgment still sits with a person.

Turn Two: Zork as the Oracle

Dreamseeds might have been well within what Fable could do. I suspected as much at the time, and I ended the playtest round by promising myself a harder target. For the next turn I picked it deliberately.

The idea was to run Zork I, the 1980 Infocom classic, on daydream’s engine, and to do it without hardcoding any Zork into the engine at all. The engine would grow generic capabilities (seven new modules: world state, a rule engine, world verbs, a clock with fuses and daemons, lighting, combat, and an LLM retell layer, the one spot where the local 7B gets to touch the Zork text), and Zork itself would show up as data, a 110-room, 120-object world written entirely as JSON. Sixteen acceptance criteria. And then, as a check, the actual original game wired in alongside: the real z-machine running under a plain interpreter, walking the same solution, required to agree with my engine on room, score, and inventory at every step. That’s the top tier of verification the April essay talks about, an oracle, ground truth you can diff against. Daydream had been getting by on proxies and critics until now. This added the real thing.

I’ll keep my pre-registered claim with its hedge attached, because the hedge matters: I wrote that this “probably wouldn’t be possible in Opus 4.8 without significant manual work, multiple turns, and a very hands-on multi-day approach.” I never re-ran the turn on Opus and I never will, so that isn’t a controlled comparison, just a claim measured against the steered, hands-on Opus turns already in this repo’s history. And the honest prediction I put next to it was two to four sessions, not one. I wasn’t going to pretend I expected another single-session miracle.

It took two sessions. My entire steering input across both was “implement,” “implement,” and one “continue” after another infrastructure hiccup. Twenty-eight commits, each green, closed 14 of the 16 criteria; the last two are gated on things only I can do (placing two files for the oracle run, and finishing the playthrough myself), and the record leaves them open rather than rounding up. The committed walkthrough reaches exactly 350 points and wins the game with a spy in the tests that fails the whole suite if a single LLM call sneaks in, so I know the state machine gets there on its own. Then the same run, replayed live against the running server with the models up, finished at 350 in 106 seconds, rooms painting themselves as I passed through. The big review pass was the largest this repo has ever had, 79 files and about 26,000 lines added, and it came back with no BLOCKs and no WARNs.

My favorite piece of it is a test that enforces something odd: there are zero Zork-specific strings anywhere in the engine code. It’s a real test, not a grep I run when I remember to, and it has teeth. It caught the session naming its own internal docstrings after the turn, and it caught the one genuinely dumb mistake of the day, a careless find-and-replace where the word “troll” hiding inside “controlled” briefly turned 328 tests red. The engine that runs Zork has no idea Zork exists, and that’s the entire platform argument, written as a passing test.

The oracle paid off before I ever ran it. The session pulled the original ZIL source (Infocom open-sourced it) and used it as ground truth while authoring, which surfaced exactly 110 rooms, five of which I’d have forgotten from memory, and confirmed the scoring arithmetic independently: 143 points for taking treasures, 129 for depositing them, 78 in room bonuses, adding up to the 350 I was expecting. A fiddly detail about a candle timer, the thing that fixed a late puzzle, came straight out of reading the 46-year-old code rather than testing my own.

It wasn’t clean the whole way. A descent without a lamp fed me to a grue. The thief’s dice killed the walkthrough twice before the turn worked out his exact death-roll timing. A collision between two state keys quietly ate one command. The live rehearsal desynced at command 151. But every one of those got caught, diagnosed, and pinned down with a regression test by tooling the same turn had built a few hours earlier, and I mostly learned about them by reading commit messages after the fact. The phrase I used in the notes is the honest one: this was a real jump in autonomy and holding-power, not in getting everything right the first time.

Then I played it. What came out of my mouth, unprompted, was “Okay, my initial impression is that this is pretty amazing (how you applied Zork to this project).” Same person who’d refused to be impressed a few hours earlier. And even then, ten minutes of playing turned up five more experience-level problems no test had caught, so the skeptic in me wasn’t out of a job. I’ll come back to those.

What the Harness Did

The April essay’s bet was that a thin harness with heavy verification is the part of your setup that survives a model jump. Here’s what actually carried the weight.

The tiered test gate is why twenty-eight commits landed green with no one watching (a pre-commit hook runs the fast tests every time). The drift-golden habit, baselines committed for everything the local models generate, is why this turn’s new quality checks (a retell probe, image anchors, a parser corpus) were an afternoon’s work, not new infrastructure. Spec-as-contract is why a /clear between sessions cost nothing: session two rebuilt from SPEC.md, a memory file, and the files on disk, then wrote seventy-five more rooms without amending a decision. The review-and-marker machinery produced a recorded zero-BLOCK close with no human in the loop. And writing predictions down first, append-only, is the only reason I can grade any of this instead of vibing about it.

There’s almost nothing disposable in zat.env. I never built prompt chains or hardcoded decompositions to cover for the old model, so when the bigger one showed up there was nothing in its way to knock over. The line I keep from the notes: the rails I built to babysit a weaker model turned out to be the instruments a stronger one plays.

Two outside facts back this up. One is the git history: the harness didn’t change for Fable, before or during. The other still makes me laugh. On May 7 I adopted a v2.0 spec for zat.env’s own autonomy, the /loop work, my roadmap for running the review-fix loop without me. On June 11 I shelved it, zero of six criteria, no code. Three weeks later the Fable turns ran with almost no steering anyway, on unchanged rails. zat.env’s README has said for years that what changes over time isn’t the architecture, it’s how much autonomy you get. I planned to build the autonomy, gave up, and the model handed it to me through the pieces I’d already shipped. That’s the bitter lesson of agentic coding landing on my own to-do list.

One thing cut the other way, and it’s worth being exact about, because it was the only real wall all day. Not the model, the environment. The permission classifier that clears shell commands (itself a model, on someone else’s servers) went down for a long stretch of turn one, and later the sandbox refused to compile the frotz interpreter I’d already approved, deferring the oracle’s final run to a one-command manual step. When the model stops being your bottleneck, everything around it lines up to take a turn.

What Did Not Move

A few things pointedly did not move, and they’re as much the story as the parts that did.

The design-time step doesn’t reach the runtime, and that boundary is the most interesting thing the turn drew. Daydream’s premise is that a strong model reaches players through pre-baked scaffolding, not by running live: Fable authors better exemplars and prompts; the 7B performs inside them. Fable did write tighter scaffolding, and the 7B stays visibly more on-rails for it, but that buys reliability, not a better writer. Its prose ceiling is fixed and the swap never comes near it. The 7B’s runtime role even shifts between the two turns. In Dreamseeds it generates whole new rooms from a player’s phrase. In Zork it has a narrower role: a retell layer that takes an authored line and lightly rephrases it, so a room you pass through twice doesn’t read word-for-word identical. The Zork prose is fixed by design, it has to be Zork, so here the 7B is paraphrasing, not inventing, and that’s exactly where its ceiling shows. I shipped retell scoped down rather than fully on, because the little model gilds Zork’s flat, dry voice into mush:

authored: “In the corner of the room on the ceiling is a large vampire bat who is obviously deranged and holding his nose.”

retold: “A large vampire bat is noted to be perched upon the ceiling in a corner of the chamber, its demeanor manifestly agitated as it clutches at its nasal region.”

The joke dies on contact. Two guardrails made it shippable (the authored line always goes first; the prompt bans fancier synonyms), but the point holds: a stronger authoring model makes the pre-baking premise more valuable and does nothing for the model performing live.

The other thing that didn’t move is me. I stayed necessary at two spots: writing the intent (the spec-level ask, plus one call during Zork planning that the game had to actually be Zork, or the oracle meant nothing) and judging whether it felt right. Both are judgment, not process. I’d predicted my playthrough would surface at least three experience problems no automated check caught; it found five in ten minutes. The best: every character’s movement narrated to me as “you go west,” including other players leaving the room, a purely multiplayer bug no solo test, no oracle, and no 380-command rehearsal could ever have seen, because in all of them I was alone in the world.

The Scorecard

Tallied against the two things I started with, my own essay and my own harness:

Most of it held. The mechanical step-change showed plainly: same rails all day, an 8-criterion feature in the morning and a 16-criterion, seven-module, 110-room one by evening, the difference being how much I could push through the same machine rather than anything I rebuilt. (None of it was one-shot, so this says nothing about the state-of-the-one-shot; what stepped is how much the band above it eats per prompt.) The spec held as the control surface at 8 criteria and again at 16, unamended both times. Two /clears cost nothing. Nothing got wiped out, because I’d never built the disposable kind of scaffolding, so the durable kind just compounded. The oracle beat the proxies and critics and paid off at authoring time, before it ever ran. And zat.env’s own claim, that verification quality and not prompt engineering sets the ceiling, reads like a caption for the day: I added nothing to the harness and had nothing compensatory to delete.

Two things came out more complicated. First, the word “step” itself. My essay treats capability as either smooth or stepping, and after 4.6 and now Fable I don’t think that’s granular enough: a step up, a real increase in what fits through the harness, is a different animal from a 4.6-class step function that lifts the whole floor. The binary flattens a distinction that turns out to matter. Second, “verification is the ceiling” needs a qualifier I didn’t give it in April. It’s the ceiling on correctness. Felt experience is a second ceiling above that one, and no amount of thicker verification reaches it, which is what a fully green game narrating a smile onto my face was telling me. The bottleneck doesn’t vanish, it relocates. (A smaller one: my autonomy ladder reads like something you climb by building the next rung, but this rung arrived as capability while my config still said “gated.”)

Nothing flatly contradicted the essay, which is a suspicious thing to report about your own theory, so: this is n=1, one operator, one project, my own harness, some grading done by the model family under test. What keeps it honest is the method, the predictions and the falsifiers written down first (a BLOCK, a spec rework, a bottom-rung ship, a steered turn, none of which happened), all in append-only git, plus an operator who spent half the day unimpressed. The closest thing to a contradiction points back at me: I wrote that the bottleneck is never the model but how you hold it, and after the jump the two things that stopped me cold were an outage and a safety gate, neither the model nor my grip on it. The essay has nothing to say about the layer around the model. It should.

Was It Another 4.6?

Start with the smaller question, the one the essay actually asked. It predicted that the scaffolding you build for an old model’s weak spots becomes the thing blocking the new model’s strengths, all at once, when the next model lands. Nothing blocked here, because I’d never built that kind of scaffolding. The load stepped up and the rails held, which is about the best a single day of evidence the verification-first bet could ask for.

The bigger question is the one I lost sleep over. After the Zork turn I’m sure Fable is a real step up from Opus 4.8, and if the economics worked I’d switch to it as my main model tomorrow. But is it another 4.6? No, and my disappointment there says more about my expectations than about the model. A conversation about open models sharpened why. A friend and I were talking about GLM-5.2 and where local coding models are headed, and we landed in the same place: if you could run something with 4.6’s ability on your own hardware, you might stop caring much what the frontier ships next. That is the real weight of the plateau I described up top. Once the floor has risen to where one person and a harness can build almost anything, the next model has to lift the whole floor again to count, not just hand you a sharper tool to use on it.

Fable is a sharper tool on the same floor. That’s genuinely useful and I don’t want to undersell it, but it isn’t the floor rising. Whether the floor ever jumps like that again, I honestly can’t tell you. Maybe 4.6 was a one-time thing, the zero-to-one where agentic coding crossed from not-really to yes-really, and nothing later lands the same way because that particular leap only happens once. Or maybe a model two or three generations out just one-shots the whole project and makes this entire methodology look quaint. I can’t call it. What I can do is be ready to measure it: write down what you expect, change one thing, and let notes you can’t edit keep score. The harness didn’t change through any of this, and that, more than anything Fable did, is what I’d hang onto.

A Step Up, Not a Step Function

I know how that reads: complain on reddit that a new coding model left you underwhelmed and someone will be along to set you straight: “it’s above your skill level, bro.” And they have a point. Holding it right is the entire thesis of this blog and of zat.env, so I’m not really in a position to roll my eyes at the jab. Maybe the step function is right there in Fable and I just haven’t learned to hold this one yet. That’s the one possibility this whole experiment was never built to rule out.