Thinking Depth Regression

April 2026

Peter Zatloukal

Audience: Software engineers building toward autonomous coding loops.
Reading time: ~15 minutes.

This post is detailed, and the specific regressions it documents are now fixed. It is still worth reading not for the failure arc, exciting as that was, but because the incident proved to be a useful oracle for zat.env itself. In the Bitter Lesson sense, an oracle is a ground-truth reference: the rare signal that tells you whether your verification loop actually works. Several weeks of silent model degradation under the harness is exactly that. The defenses that held (effortLevel: high, the pre-push review gate, spec-driven acceptance criteria, the two-failure circuit breaker) were not validated by a controlled test. They were validated by an unintended natural experiment. The defenses that were missing (edit-without-read hooks, stop-phrase guards) became visible because their absence cost something measurable. The forensic detail below is what made those conclusions possible.

Updates:

April 24, 2026 — Anthropic’s official postmortem on three Claude Code regressions: what it confirms, what it revises, what it adds.
April 13, 2026 — cache TTL forensics, the accidental source code leak, and Boris Cherny’s evolving position on adaptive thinking.

What Happened

In early April 2026, a detailed analysis of Claude Code session logs surfaced on GitHub (issue #42796, filed by Stella Laurenzo, with data analysis by Ben Vanik). The dataset covered 17,871 thinking blocks and 234,760 tool calls across 6,852 session files from January through March 2026, all from IREE compiler work, a serious production codebase.

The analysis showed measurable behavioral regression starting in mid-February: a 70% drop in the ratio of file reads to edits (from 6.6 to 2.0), a jump from zero to 173 stop-hook violations in 17 days, an 80x increase in API requests for roughly the same human effort, and a 642% increase in the model using the word “simplest” in its responses. The estimated cost for equivalent human effort went from $345/month to $42,121/month, because degraded output required endless correction cycles.

The original report attributed this to Anthropic secretly reducing thinking depth and then hiding the evidence by redacting thinking content from local transcripts. The reality turned out to be more nuanced, but the measured behavioral regressions are real regardless of root cause.

Three Changes, One Experience

Three things happened in February and March 2026 that together produced the experience users reported:

Adaptive thinking (February 9). Opus 4.6 replaced fixed thinking budgets with per-turn adaptive allocation, where the model decides how long to think on each turn. This is a reasonable optimization in theory: not every turn needs deep reasoning, and allocating thinking tokens dynamically should improve the cost-quality tradeoff.

Default effort dropped to medium (March 3). The effort level, which controls the ceiling on thinking depth, was quietly changed from high to medium (internally “85”) for all users. Anthropic’s rationale: medium effort is “a sweet spot on the intelligence-latency/cost curve for most users.” This was not communicated to users. Existing workflows that depended on high-effort reasoning silently degraded.

Thinking redaction (rolled out March 5-12). A header (redact-thinking-2026-02-12) began hiding thinking content from the UI and local session transcripts. Anthropic says this is purely a display change that does not affect actual thinking budgets. But it made the other two changes invisible: users could no longer see when the model was thinking shallowly, and transcript-based analysis tools could no longer measure thinking depth directly.

The original report’s causal story, “Anthropic secretly reduced thinking depth by 67% and hid it with redaction,” overstates what can be proven. The redaction header appears to be what Anthropic claims: a UI-only change. But the combination of adaptive thinking sometimes allocating poorly and the effort default dropping without notice produced a real, measurable quality regression that the redaction made harder to detect. The experience was the same whether the root cause was one change or three.

Boris’s Evolving Conclusion

Boris Cherny, the creator of Claude Code, responded in both the GitHub issue and on Hacker News. His thinking evolved visibly over the course of a single day.

First response: He pushed back on the original analysis. The thinking redaction header is UI-only, not a thinking budget change. Two configuration changes (adaptive thinking and the effort default) explain what users are seeing. The fix: use /effort high or /effort max, and optionally disable adaptive thinking with CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING=1. He closed the GitHub issue as “completed.”

After reviewing specific sessions: He examined five transcript IDs from a user whose sessions were already sending effort=high. His conclusion shifted:

“The data points at adaptive thinking under-allocating reasoning on certain turns. The specific turns where it fabricated (stripe API version, git SHA suffix, apt package list) had zero reasoning emitted, while the turns with deep reasoning were correct. We’re investigating with the model team.”

This is the important finding. The sessions were already using high effort. The effort default was not the problem. Adaptive thinking was allocating zero reasoning on turns that needed it, and the model fabricated on exactly those turns. Turns with deep reasoning were correct. The correlation between thinking depth and output quality was direct and measurable.

Several commenters pointed out the contradiction: the issue was closed as “just change your settings,” but Boris’s own analysis confirmed a bug in adaptive thinking that settings alone do not fix. The issue remains closed as of this writing.

The cleanest summary: (1) the “67% hidden by redaction” theory is overstated, (2) the medium effort default made some people feel a drop and that is probably partly true, (3) but there also appears to be a real bug or failure mode where adaptive thinking sometimes allocates far too little reasoning for turns that need it.

The Behavioral Evidence Matters More Than the Causal Story

The GitHub issue’s causal narrative may be partly wrong, but the behavioral measurements are exactly the kinds of things that matter for engineering workflows:

Read:Edit ratio collapsed. The model went from reading 6.6 files per edit to 2.0. One in three edits in the degraded period was made to files the model had not read. This is the single most damning metric. An agent that edits before reading is an agent that guesses.

Stop-hook violations appeared from nowhere. A hook catching ownership-dodging (“shall I…?”), permission-seeking, and premature stopping fired zero times before March 8 and 173 times in the 17 days after. The model was not just thinking less. It was behaving differently: hedging, asking permission instead of acting, declaring victory early.

Convention adherence degraded. The model stopped following CLAUDE.md instructions as reliably. Multiple users reported the model ignoring project conventions, custom skills, and explicit instructions in the system prompt.

Time-of-day variation appeared. Pre-change, thinking depth was roughly flat across hours. Post-change, there was an 8.8x ratio between the best and worst hours, with 5pm PST (peak US usage) being the worst. This suggests thinking allocation became load-sensitive.

The “simplest fix” pathology. The model’s tendency to propose the minimum viable hack instead of the correct solution increased measurably. One HN commenter captured it: “Whenever Claude says ‘the simplest fix is…’ it’s usually suggesting some horrible hack.”

These are not vibes. They are measurable workflow regressions that happen regardless of whether the root cause is redaction, adaptive routing, effort defaults, long context degradation, or some mix of all four.

What the Community Discussion Revealed

The Hacker News thread (1,209 points, 111 top-level comments) and the GitHub issue (99 comments) split into three camps:

Camp 1: “You’re holding it wrong.” Users who adapted their workflows (plan mode, small tasks, explicit review at each step) reported few issues. Their argument: Claude Code works fine if you treat it as a tool that needs structure, not an autonomous agent you can trust to run unsupervised.

Camp 2: “They nerfed it.” Users who had invested in multi-agent autonomous workflows that previously worked were furious. The report author addressed this directly: “the community is gaslighting itself with ‘write a better prompt.’” These users had working systems that broke. No amount of prompt engineering fixes a model that allocates zero reasoning tokens to a turn.

Camp 3: “The AI-generated analysis cannot be trusted.” Multiple commenters questioned whether a report written by the model that is allegedly broken could be trusted. The counter: the data pipeline (stop hooks, log analysis) was built by the human; Claude formatted the output. And the behavioral metrics (stop-hook violations, read:edit ratios) are mechanical measurements, not model opinions.

The most useful insight came from users who had built observability into their workflows. Ben Vanik’s stop-phrase-guard hook, shared as a public gist, caught specific behavioral regressions in real time. Users with monitoring could distinguish “the model is having a bad turn” from “my prompt is wrong.” Users without monitoring could only guess.

Several concrete workarounds emerged:

CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING=1 forces a fixed reasoning budget (the key one; one user reported zero fabrications in 20 hours after setting it, with a 30-40% token cost increase)
/effort max or CLAUDE_CODE_EFFORT_LEVEL=max overrides the default medium effort
showThinkingSummaries: true in settings.json makes thinking visible again
CLAUDE_CODE_AUTO_COMPACT_WINDOW=400000 forces shorter context to avoid late-session degradation

The Implication for Harness Design

Here is the conclusion I want to draw from all of this, and it goes beyond the specific February 2026 incident.

Agent harnesses should assume reasoning quality is variable and sometimes misallocated, even when the model is generally strong. This is not a Claude-specific problem. Any model that uses adaptive reasoning, which is the direction the entire industry is moving, will sometimes get the allocation wrong. Any provider that optimizes for cost efficiency across millions of users will sometimes serve a turn with less reasoning than that turn needed. Any system that makes internal allocation decisions opaque to the user will sometimes degrade without visible warning.

The safe design response is not to chase hidden internal state. You cannot control or even observe thinking token allocation from outside the model. The response is to make important work legible at the artifact boundary and verifiable at the tool boundary.

“Artifact boundary” means: the outputs the model produces (files, commits, review findings) should carry enough information that a downstream process or a human can evaluate them without needing to know how much thinking went into them. A commit that changes code the model never read is detectable from the outside. A review finding with no evidence citation is detectable from the outside. A “simplest fix” hack that does not match the spec criteria is detectable from the outside.

“Tool boundary” means: the harness can observe which tools the model calls, in what order, with what arguments. A model that calls Edit without a preceding Read on the same file is observable. A model that emits stop-phrases (“shall I continue?”, “the simplest approach would be”) is observable. These signals do not require access to thinking tokens. They are behavioral signatures of shallow reasoning that can be caught by hooks.

This is where the design philosophy shifts from “make the model think harder” (which you cannot control) to “detect and contain the damage when it does not” (which you can).

zat.env was already pointed in this direction before the incident. Effort: max on all review and spec skills. A content-addressed pre-push gate that blocks code from leaving the machine without passing review. Spec-driven acceptance criteria that constrain “simplest fix” drift. Small increments with test verification and a two-failure circuit breaker (see The Bitter Lesson of Agentic Coding). These are defenses against variable output quality at the artifact boundary. The opportunity is to finish that thought and move from “good gated workflow” to “workflow with shallow-turn detectors.”

What We Changed

After analyzing the incident, the community workarounds, and what zat.env’s architecture already provides, we made three small changes. The decision process is more interesting than the changes themselves, because the most important decisions were about what not to do.

Bumped effortLevel to high. The install script now sets effortLevel: "high" in settings.json, overriding the silent medium default. This protects implementation turns between skill checkpoints. Skills already override to effort: max via frontmatter at critical decision points (spec definition, code review, security review, architecture assessment). The effort level setting is a first-class, versioned, documented configuration key. It is not a prompt. It is infrastructure.

Enabled showThinkingSummaries. The install script now sets showThinkingSummaries: true in settings.json, reversing the default redaction. This is pure observability: it makes thinking visible in the UI so a human can notice when reasoning is shallow. It costs nothing and provides an ambient signal.

Added claude-fixed-reasoning convenience script. A three-line shell script in ~/bin/ that launches Claude Code with CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING=1. This forces a fixed reasoning budget instead of per-turn adaptive allocation. It exists as an opt-in tool choice, not a global default.

That last point is the important design decision. We deliberately did not apply the adaptive thinking disable globally. Here is why.

What We Rejected and Why

Global CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING=1. This was the most-discussed workaround in the community, and the one Boris specifically recommended after confirming the zero-reasoning bug. But disabling adaptive thinking on Opus 4.6 loses interleaved thinking entirely and reverts to a deprecated code path. The tradeoff is not just “30-40% more tokens.” It is a qualitative regression on the model’s architecture. Meanwhile, zat.env’s pre-push gate already contains bad implementation turns at the output boundary: nothing ships without review at effort: max. The cost of a shallow turn in a gated workflow is wasted tokens and time, not shipped bad code. A convenience script lets the human opt into fixed reasoning for specific sessions where thrash is unacceptable. A global default would impose the cost everywhere, including turns where adaptive allocation works fine.

Instruction-based workarounds. The Reddit and dev.to communities generated dozens of CLAUDE.md instructions aimed at counteracting shallow reasoning: “prefer correct, complete implementations over minimal ones,” “fix the root cause, not the symptom,” “use appropriate data structures.” We evaluated these carefully and rejected all of them.

The reasoning: these are prompt-level requests competing with Claude Code’s own system prompt, which includes language like “Return the simplest working solution.” CLAUDE.md content is wrapped in a system reminder that explicitly tells the model it “may or may not be relevant.” Community research indicates CLAUDE.md compliance is approximately 70-80%. On shallow turns, where the model is reasoning minimally, prompt instructions are least likely to be followed. On deep turns, where the model is reasoning fully, these instructions are superfluous because the model already does them.

More fundamentally, zat.env handles these failure modes with stronger mechanisms. “Be complete” is weaker than a spec with concrete acceptance criteria. “Fix root cause” is weaker than a code review gate that catches band-aid fixes. Instructions are suggestions. Hooks are enforcement. The difference matters exactly when the model is reasoning shallowly, which is exactly when instructions are most needed and least effective.

There is also a budget constraint. Community research converges on an instruction ceiling of approximately 150-200 distinct rules that frontier LLMs reliably follow. Claude Code’s system prompt consumes roughly 50 of those. Each additional CLAUDE.md instruction competes for the remaining budget. Adding instructions that duplicate what the review gate already enforces is not free: it dilutes the instructions that are not enforced elsewhere.

Behavioral hooks (edit-without-read detection, stop-phrase guard). These are the most architecturally interesting mitigations. Ben Vanik’s stop-phrase-guard hook caught 173 violations with zero false positives. An edit-without-read hook would enforce the discipline that collapsed from a 6.6 to 2.0 read:edit ratio. Both fit naturally into zat.env’s existing hook architecture (bash scripts, registered in settings.json).

We deferred these, not rejected them. They are the right direction but require careful design: state tracking across tool calls for read-before-edit, false-positive tuning for stop-phrase detection, and interaction testing with the existing hook pipeline. They are candidates for the next iteration, not responses to deploy alongside a settings change.

The Principle

The incident is specific to Claude Code, adaptive thinking, and a particular set of configuration changes. But the lesson generalizes.

The parallel to traditional infrastructure is exact. We do not build web services assuming the network is always fast. We build them assuming the network is variable: sometimes fast, sometimes slow, sometimes down. We add timeouts, retries, circuit breakers, and health checks. We design for degraded operation because degraded operation is a certainty, not an edge case.

Agentic coding harnesses need the same mindset. Design for variable reasoning quality. Assume some turns will be shallow. Gate important transitions on high-effort verification. Keep the blast radius of any single bad turn small through incremental work and frequent testing. And when a specific, confirmed bug makes the variability worse than it should be, give the human a tool to opt into a different allocation strategy for that session, rather than imposing a global workaround that trades one set of capabilities for another.

The response is not “stuff the harness with more prompt text.” Prompts are suggestions. Hooks are enforcement. The difference matters exactly when the model is reasoning shallowly, which is exactly when prompts are least likely to be followed and hooks are most needed.

The model will get better. The adaptive thinking bug will probably be fixed. The effort default may be reconsidered. But the structural property, that reasoning quality is variable and sometimes misallocated, will remain true for any adaptive system. Building defenses against that property is a durable investment.

April 13, 2026: What We Know Now

Six days after the original post, the story has expanded. The adaptive thinking regression was not the only silent change in March 2026. A parallel investigation uncovered a prompt cache TTL regression, and Boris Cherny’s public comments on both issues have filled in architectural details that were invisible at the time of writing.

A Fourth Change: The Cache TTL Regression

The original post identified three changes in February and March that together produced the quality regression. There was a fourth.

On March 6, Anthropic silently changed how Claude Code selects prompt cache TTL (time-to-live) per request. A forensic analysis of 119,866 API calls across January through April, filed as GitHub issue #46829 by Sean Swanson, showed the transition precisely:

February 1 through March 5: 100% of cache tokens used 1-hour TTL, zero exceptions across 33 consecutive days.
March 6-7: Transition period, 5-minute tokens reappear.
March 8 onward: 5-minute tokens dominate (83-93% of cache writes).

The 1-hour cache costs more to write (2x base input price vs 1.25x) but the same to read (0.1x). The tradeoff depends on whether cached content is reused within the hour. Anthropic’s Jarred Sumner called it “ongoing optimization work” and argued the change is net cheaper because many requests are one-shot calls where the 1-hour write premium is wasted. He is probably right for API users who pay per token. He is probably wrong for subscription users who are quota-limited: more cache misses mean more cache_creation tokens, which burn quota faster regardless of per-token price.

Boris confirmed the architecture in a follow-up comment: the main agent uses 1-hour cache, subagents use 5-minute cache, and the client selects TTL per request based on expected reuse pattern. He also revealed a coupling that nobody had documented: when users disable telemetry, experiment gates are also disabled, and the default TTL falls to 5 minutes. Privacy-conscious users silently get worse cache behavior. A fix (changing the client-side default) was described as imminent.

The Source Code Leak

On March 31, an npm build error accidentally exposed Claude Code’s 510,000-line TypeScript codebase. Community analysis found two mechanisms that break prompt cache matching:

Attestation data (anti-abuse proof) attached to every request contains values that differ per request, invalidating cache.
Anti-distillation tools: a flag injects varying fake tool definitions into the system prompt, triggering cache invalidation.

The analysis also noted zero tests for 64,464 lines of code. This is relevant not as gossip but as structural context: the cache optimization work Anthropic describes as “ongoing” is happening in an untested codebase where silent behavioral changes are easy to introduce and hard to catch.

Boris’s Evolving Position, Continued

In the original post, I traced how Boris’s position shifted over a single day from “just change your settings” to “adaptive thinking is allocating zero reasoning on certain turns.” The arc continued.

On April 12-13, responding to the quota exhaustion issue (#45756, 708 points on HN), Boris identified prompt cache misses on 1M context windows as the primary cost driver. He proposed reducing the default context window from 1M to 400k tokens, and described plans for “better UX to make this visible and more intelligent truncation/pruning.”

He also stated, without elaboration: “We ruled out adaptive thinking, other kinds of harness regressions, model and inference regressions” as causes of the quota issue. This is consistent: the quota problem is a cache/cost issue, not a reasoning quality issue. But the two compound. Users who burn quota faster get fewer total turns, and if some of those turns also have shallow reasoning due to the adaptive thinking bug, the experience is doubly degraded.

The pattern across both incidents is the same: transparency arrives after community forensics, not before. Users with logging infrastructure (Ben Vanik’s stop-phrase hooks, Sean Swanson’s JSONL analysis, cnighswonger’s cache interceptor) diagnosed the problems. Anthropic’s public explanations came in response to evidence that was already public. This is not unusual for infrastructure providers, but it reinforces the design principle from the original post: observability at the boundary is not optional, because the provider’s internal state is opaque and changes without notice.

What the Original Post Got Right

The central thesis holds up better than I expected.

“Agent harnesses should assume reasoning quality is variable and sometimes misallocated.” The cache TTL regression adds a second axis of variability. It is not just thinking depth that fluctuates; the cost and latency characteristics of the underlying infrastructure also shift silently. The design response is the same: gate important transitions on verification, not on assumptions about internal state.

“The response is not ‘stuff the harness with more prompt text.’” The community continued generating CLAUDE.md instructions through April. None of the users who reported recovery attributed it to prompt changes. The users who recovered attributed it to CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING=1, /effort max, or shorter context windows. Infrastructure changes fixed infrastructure problems. Prompts did not.

“Prompts are suggestions. Hooks are enforcement.” Still the most load-bearing sentence in the piece. The cache TTL coupling with telemetry is a perfect illustration: users who thought they were making a privacy choice were unknowingly making a performance choice. No prompt instruction can compensate for 5-minute cache expiry on a 1M context window. Only infrastructure-level awareness of the tradeoff helps.

The three specific changes we made (effortLevel: high, showThinkingSummaries: true, claude-fixed-reasoning convenience script) were correctly scoped. None needed to be revised. The decision to not globally disable adaptive thinking was vindicated: the convenience script gives the human a session-level choice, while the pre-push gate contains bad turns regardless.

What the Original Post Missed

The cost dimension. The original post framed the problem entirely as reasoning quality. The cache TTL regression shows there is a parallel cost/quota story: silent infrastructure changes can make the same workflow burn resources faster without changing output quality at all. A complete harness design needs to account for both. You can have good reasoning quality and still burn your quota in 90 minutes if cache behavior changes underneath you.

The telemetry trap. The coupling between telemetry opt-out and degraded cache behavior was not known at the time. It is a specific instance of a general pattern: provider-side feature flags that bundle unrelated behaviors. Disabling telemetry should not change cache TTL. That it does is an implementation detail that leaks into user-facing quality.

Subagent cost amplification. The original post did not discuss subagent architecture. Community analysis showed subagents were sometimes receiving entire session contexts, and their 5-minute cache TTL means any pause between subagent turns triggers full cache recreation. This is relevant to zat.env because the skill architecture spawns subagents for review, spec, and security passes. The cost of those subagent turns is not just the tokens they consume, but the cache writes they trigger.

What This Means for zat.env

The original post described what zat.env already had in place. Here is the updated picture, separating what was already mitigated from what remains exposed.

Already mitigated, no changes needed:

effortLevel: high in settings.json overrides the silent medium default. Skills override to effort: max at critical checkpoints (spec, review, security, architecture). This layer is working as designed.
showThinkingSummaries: true restores thinking visibility. Still useful as an ambient signal.
The pre-push review gate catches bad code at the artifact boundary regardless of thinking depth, cache behavior, or any other invisible infrastructure change. This is the most durable defense and the one least affected by anything discovered since April 7.
Small increments with test verification and the two-failure circuit breaker keep the blast radius of any single bad turn small. A cache miss that causes a shallow turn wastes one increment, not an entire session.
claude-fixed-reasoning convenience script remains correctly scoped as an opt-in tool, not a global default.

Still exposed, candidates for the next iteration:

The behavioral hooks deferred in the original post are now more strongly justified. The cache TTL data shows that turns after cache misses (any pause longer than 5 minutes for subagents, or longer than 1 hour for the main agent) are structurally different from cache-hit turns: the model reconstructs context from scratch, and the first turn after a miss is where shallow reasoning is most likely to appear. An edit-without-read hook would catch the specific failure mode (read:edit ratio collapse) that the original data showed.

Context window management is the new frontier. CLAUDE_CODE_AUTO_COMPACT_WINDOW=400000 reduces the cost of cache misses by 60% (400k vs 1M context). This is a blunt instrument, trading context depth for cache resilience, but it is the kind of tradeoff a harness should make explicitly rather than leaving to the provider’s default. Whether zat.env should set this globally or offer it as a configurable default is an open question.

The telemetry coupling should be documented. If a zat.env user disables telemetry (a reasonable privacy choice), they should know they are also getting 5-minute cache TTL. The install script could detect the telemetry setting and warn, or set an explicit cache TTL override when one becomes available.

The Broader Pattern

The original post’s concluding principle was: “The model will get better. The adaptive thinking bug will probably be fixed. But the structural property, that reasoning quality is variable and sometimes misallocated, will remain true for any adaptive system.”

Six days later, the principle extends: the structural property is not just about reasoning quality. It is about every dimension of the provider’s infrastructure that is invisible to the user: cache TTL, context window behavior, experiment gates, subagent routing, quota accounting. All of these change without notice. All of them affect the user’s experience. None of them are observable from outside unless you build the instrumentation.

The users who navigated both incidents successfully were the ones with logging, hooks, and session analysis tooling. The users who struggled were the ones who trusted the provider’s defaults and had no way to detect when those defaults changed. This is the same lesson the original post drew about thinking depth, applied one level deeper to the infrastructure underneath.

Build the observability. Gate the outputs. Assume the floor will move.

April 24, 2026: The Official Postmortem

On April 23, Anthropic published a postmortem acknowledging three separate Claude Code regressions across March and April 2026. Two of them match what the community had already diagnosed. The third was not publicly known before the postmortem, and it is the most interesting data point in the whole incident.

The Three Confirmed Regressions

Effort default: high → medium (March 4 to April 7). Matches the original post and the April 13 update exactly. Anthropic’s stated rationale: “medium effort achieved slightly lower intelligence with significantly less latency for the majority of tasks.” Users rejected the tradeoff. Reverted to high for Opus 4.6 and Sonnet 4.6; Opus 4.7 now defaults to xhigh.

Thinking cache eviction bug (March 26 to April 10, fixed in Claude Code v2.1.101). This is the one whose mechanism the April 13 update got wrong. Community forensics had identified a cache TTL regression (1-hour to 5-minute) as the cause. The postmortem describes a different mechanism: cleanup logic intended to clear reasoning from sessions crossing a 1-hour idle threshold had a bug that “kept it happening every turn for the rest of the session instead of just once.” After one trigger, every subsequent request told the API to keep only the most recent reasoning block and discard everything before it. The user-visible symptoms (forgetful behavior, repeated cache misses, faster quota drain) matched what Sean Swanson’s JSONL analysis measured. The mechanism was not deliberate TTL tuning. It was buggy cleanup stuck on a loop.

Verbosity reduction prompt (April 16 to April 20). Not previously public. Anthropic added a system prompt instruction: “Length limits: keep text between tool calls to ≤25 words. Keep final responses to ≤100 words unless the task requires more detail.” Broader internal evaluation showed a 3% intelligence drop on both Opus 4.6 and Opus 4.7. Reverted within four days.

What the Original Post Got Wrong

The adaptive thinking zero-reasoning theory. Boris Cherny’s April 7 comment described adaptive thinking allocating zero reasoning tokens on specific turns that needed them, and that framing anchored a large portion of the original post. The postmortem does not list this as one of the three regressions. The behavior the community measured may have been subsumed under the thinking cache eviction bug (a session forced to discard all prior reasoning every turn looks identical from outside to one allocating none in the first place), or a secondary effect of the medium effort default, or the initial hypothesis did not survive deeper investigation. Either way, a central causal claim in the original post is not confirmed by the official account.

The cache TTL regression as a deliberate cost-tuning change. The April 13 update treated the 1-hour-to-5-minute shift Sean Swanson measured as provider-side cost optimization, with the telemetry-coupling nuance Boris had described. The postmortem attributes the cache-miss symptoms to a bug in idle-session cleanup rather than a deliberate policy shift. The behavioral measurements were accurate. The causal explanation was speculative in places that turned out to be wrong.

The pattern: specific causal stories were overfit to accurate behavioral data. The measurements held across both updates. The mechanisms we proposed are partly revised by the postmortem.

What the Original Post Got Right

The effort default drop. Confirmed exactly.

“The response is not ‘stuff the harness with more prompt text.’” The verbosity prompt is a striking vindication. Anthropic, the provider with the most context on their own model, added a short prompt instruction to reduce latency, measured a 3% intelligence regression on their own evals, and reverted within four days. A natural experiment, run by the provider, on the cost of prompt-level behavioral nudges: measurable, significant, unacceptable. If a prompt written by Anthropic caused a 3% drop on Anthropic’s own evals, user-authored CLAUDE.md instructions aimed at similar behavioral changes (reduce verbosity, be concise, prefer brevity) can be expected to impose a similar tax while being compliance-limited at the 70-80% rate the community measured. Prompt changes are not free.

“Build the observability. Gate the outputs. Assume the floor will move.” Three regressions in seven weeks, two invisible to users without logging infrastructure, one caught internally before wide impact. The structural argument holds.

The three specific zat.env changes remain correctly scoped. effortLevel: high in settings.json was the right defense against the medium default and is now aligned with Anthropic’s acknowledged correct setting for Opus 4.6. Opus 4.7’s xhigh default makes our high override defensive rather than corrective on that model, but no worse. showThinkingSummaries: true and the claude-fixed-reasoning convenience script remain useful opt-ins.

The Durable Lesson

Three regressions landed in seven weeks. Two were caught by community forensics before Anthropic disclosed them. One was caught by Anthropic’s internal evaluation before users noticed. Anthropic’s announced preventative measures (broader per-model evals, ablation testing on each prompt line, soak periods, gradual rollouts for intelligence-impacting changes) are the right response, but they are still internal processes measuring what the provider chooses to measure. They reduce the frequency of silent regressions. They do not eliminate them.

The original post’s concluding principle stands without modification: the model will get better; the structural property that reasoning quality and infrastructure behavior are variable, sometimes misallocated, and sometimes regressing silently will remain true for any adaptive system. Specific root causes change. The response does not. Gate the artifacts. Measure at the boundary. Keep the blast radius of any single bad turn small.

References (original):

Vanik, B. (2026). Extended Thinking Is Load-Bearing for Senior Engineering Workflows. GitHub issue with quantitative session log analysis of Claude Code behavioral regression.
Vanik, B. (2026). stop-phrase-guard.sh. Hook script catching ownership-dodging and premature-stopping patterns.
Cherny, B. (2026). HN comment on adaptive thinking under-allocation. Confirms adaptive thinking allocating zero reasoning on turns that needed it.
Hacker News (2026). Claude Code is unusable for complex engineering tasks with Feb updates. 1,209-point discussion thread.
Carlini, N. (2026). Building a C Compiler with a Team of Parallel Claudes. The demonstration that verification loop quality determines the ceiling of autonomous output.
Rajasekaran, P. (2026). Harness Design for Long-Running Application Development. Generator-evaluator architecture for extended autonomous sessions.

References (April 13 update):

Swanson, S. (2026). Cache TTL silently regressed from 1h to 5m around early March 2026. Forensic analysis of 119,866 API calls showing the March 6 cache TTL transition.
Sumner, J. (2026). Comment on cache TTL issue. Anthropic staff confirming the March 6 change was intentional and explaining per-request TTL selection.
Cherny, B. (2026). HN comment on cache architecture. Describes 1h/5m TTL tiers, experiment gate coupling with telemetry, and planned env var overrides.
Hacker News (2026). Pro Max 5x Quota Exhausted in 1.5 Hours Despite Moderate Usage. 708-point discussion thread on cache miss costs and quota exhaustion.
Reddit r/ClaudeCode (2026). Boris is claiming that Claude Code has a one hour cache. Community discussion of Boris’s cache TTL claims and independent verification.

References (April 24 update):

Anthropic (2026). Claude Code Quality Postmortem. Official root-cause analysis of three regressions: reasoning effort default (March 4 – April 7), thinking cache eviction bug (March 26 – April 10), and verbosity reduction prompt (April 16 – April 20). Lists preventative measures including broader per-model evaluations, per-prompt-line ablation testing, and soak periods for intelligence-impacting changes.