Thinking Depth Regression
Audience: Software engineers building toward autonomous coding loops.
Reading time: ~15 minutes.
What Happened
In early April 2026, a detailed analysis of Claude Code session logs surfaced on GitHub (issue #42796, filed by Stella Laurenzo, with data analysis by Ben Vanik). The dataset covered 17,871 thinking blocks and 234,760 tool calls across 6,852 session files from January through March 2026, all from IREE compiler work, a serious production codebase.
The analysis showed measurable behavioral regression starting in mid-February: a 70% drop in the ratio of file reads to edits (from 6.6 to 2.0), a jump from zero to 173 stop-hook violations in 17 days, an 80x increase in API requests for roughly the same human effort, and a 642% increase in the model using the word “simplest” in its responses. The estimated cost for equivalent human effort went from $345/month to $42,121/month, because degraded output required endless correction cycles.
The original report attributed this to Anthropic secretly reducing thinking depth and then hiding the evidence by redacting thinking content from local transcripts. The reality turned out to be more nuanced, but the measured behavioral regressions are real regardless of root cause.
Three Changes, One Experience
Three things happened in February and March 2026 that together produced the experience users reported:
Adaptive thinking (February 9). Opus 4.6 replaced fixed thinking budgets with per-turn adaptive allocation, where the model decides how long to think on each turn. This is a reasonable optimization in theory: not every turn needs deep reasoning, and allocating thinking tokens dynamically should improve the cost-quality tradeoff.
Default effort dropped to medium (March 3). The effort level, which controls the ceiling on thinking depth, was quietly changed from high to medium (internally “85”) for all users. Anthropic’s rationale: medium effort is “a sweet spot on the intelligence-latency/cost curve for most users.” This was not communicated to users. Existing workflows that depended on high-effort reasoning silently degraded.
Thinking redaction (rolled out March 5-12). A header (redact-thinking-2026-02-12) began hiding thinking content from the UI and local session transcripts. Anthropic says this is purely a display change that does not affect actual thinking budgets. But it made the other two changes invisible: users could no longer see when the model was thinking shallowly, and transcript-based analysis tools could no longer measure thinking depth directly.
The original report’s causal story, “Anthropic secretly reduced thinking depth by 67% and hid it with redaction,” overstates what can be proven. The redaction header appears to be what Anthropic claims: a UI-only change. But the combination of adaptive thinking sometimes allocating poorly and the effort default dropping without notice produced a real, measurable quality regression that the redaction made harder to detect. The experience was the same whether the root cause was one change or three.
Boris’s Evolving Conclusion
Boris Cherny, the creator of Claude Code, responded in both the GitHub issue and on Hacker News. His thinking evolved visibly over the course of a single day.
First response: He pushed back on the original analysis. The thinking redaction header is UI-only, not a thinking budget change. Two configuration changes (adaptive thinking and the effort default) explain what users are seeing. The fix: use /effort high or /effort max, and optionally disable adaptive thinking with CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING=1. He closed the GitHub issue as “completed.”
After reviewing specific sessions: He examined five transcript IDs from a user whose sessions were already sending effort=high. His conclusion shifted:
“The data points at adaptive thinking under-allocating reasoning on certain turns. The specific turns where it fabricated (stripe API version, git SHA suffix, apt package list) had zero reasoning emitted, while the turns with deep reasoning were correct. We’re investigating with the model team.”
This is the important finding. The sessions were already using high effort. The effort default was not the problem. Adaptive thinking was allocating zero reasoning on turns that needed it, and the model fabricated on exactly those turns. Turns with deep reasoning were correct. The correlation between thinking depth and output quality was direct and measurable.
Several commenters pointed out the contradiction: the issue was closed as “just change your settings,” but Boris’s own analysis confirmed a bug in adaptive thinking that settings alone do not fix. The issue remains closed as of this writing.
The cleanest summary: (1) the “67% hidden by redaction” theory is overstated, (2) the medium effort default made some people feel a drop and that is probably partly true, (3) but there also appears to be a real bug or failure mode where adaptive thinking sometimes allocates far too little reasoning for turns that need it.
The Behavioral Evidence Matters More Than the Causal Story
The GitHub issue’s causal narrative may be partly wrong, but the behavioral measurements are exactly the kinds of things that matter for engineering workflows:
Read:Edit ratio collapsed. The model went from reading 6.6 files per edit to 2.0. One in three edits in the degraded period was made to files the model had not read. This is the single most damning metric. An agent that edits before reading is an agent that guesses.
Stop-hook violations appeared from nowhere. A hook catching ownership-dodging (“shall I…?”), permission-seeking, and premature stopping fired zero times before March 8 and 173 times in the 17 days after. The model was not just thinking less. It was behaving differently: hedging, asking permission instead of acting, declaring victory early.
Convention adherence degraded. The model stopped following CLAUDE.md instructions as reliably. Multiple users reported the model ignoring project conventions, custom skills, and explicit instructions in the system prompt.
Time-of-day variation appeared. Pre-change, thinking depth was roughly flat across hours. Post-change, there was an 8.8x ratio between the best and worst hours, with 5pm PST (peak US usage) being the worst. This suggests thinking allocation became load-sensitive.
The “simplest fix” pathology. The model’s tendency to propose the minimum viable hack instead of the correct solution increased measurably. One HN commenter captured it: “Whenever Claude says ‘the simplest fix is…’ it’s usually suggesting some horrible hack.”
These are not vibes. They are measurable workflow regressions that happen regardless of whether the root cause is redaction, adaptive routing, effort defaults, long context degradation, or some mix of all four.
What the Community Discussion Revealed
The Hacker News thread (1,209 points, 111 top-level comments) and the GitHub issue (99 comments) split into three camps:
Camp 1: “You’re holding it wrong.” Users who adapted their workflows (plan mode, small tasks, explicit review at each step) reported few issues. Their argument: Claude Code works fine if you treat it as a tool that needs structure, not an autonomous agent you can trust to run unsupervised.
Camp 2: “They nerfed it.” Users who had invested in multi-agent autonomous workflows that previously worked were furious. The report author addressed this directly: “the community is gaslighting itself with ‘write a better prompt.’” These users had working systems that broke. No amount of prompt engineering fixes a model that allocates zero reasoning tokens to a turn.
Camp 3: “The AI-generated analysis cannot be trusted.” Multiple commenters questioned whether a report written by the model that is allegedly broken could be trusted. The counter: the data pipeline (stop hooks, log analysis) was built by the human; Claude formatted the output. And the behavioral metrics (stop-hook violations, read:edit ratios) are mechanical measurements, not model opinions.
The most useful insight came from users who had built observability into their workflows. Ben Vanik’s stop-phrase-guard hook, shared as a public gist, caught specific behavioral regressions in real time. Users with monitoring could distinguish “the model is having a bad turn” from “my prompt is wrong.” Users without monitoring could only guess.
Several concrete workarounds emerged:
CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING=1forces a fixed reasoning budget (the key one; one user reported zero fabrications in 20 hours after setting it, with a 30-40% token cost increase)/effort maxorCLAUDE_CODE_EFFORT_LEVEL=maxoverrides the default medium effortshowThinkingSummaries: truein settings.json makes thinking visible againCLAUDE_CODE_AUTO_COMPACT_WINDOW=400000forces shorter context to avoid late-session degradation
The Implication for Harness Design
Here is the conclusion I want to draw from all of this, and it goes beyond the specific February 2026 incident.
Agent harnesses should assume reasoning quality is variable and sometimes misallocated, even when the model is generally strong. This is not a Claude-specific problem. Any model that uses adaptive reasoning, which is the direction the entire industry is moving, will sometimes get the allocation wrong. Any provider that optimizes for cost efficiency across millions of users will sometimes serve a turn with less reasoning than that turn needed. Any system that makes internal allocation decisions opaque to the user will sometimes degrade without visible warning.
The safe design response is not to chase hidden internal state. You cannot control or even observe thinking token allocation from outside the model. The response is to make important work legible at the artifact boundary and verifiable at the tool boundary.
“Artifact boundary” means: the outputs the model produces (files, commits, review findings) should carry enough information that a downstream process or a human can evaluate them without needing to know how much thinking went into them. A commit that changes code the model never read is detectable from the outside. A review finding with no evidence citation is detectable from the outside. A “simplest fix” hack that does not match the spec criteria is detectable from the outside.
“Tool boundary” means: the harness can observe which tools the model calls, in what order, with what arguments. A model that calls Edit without a preceding Read on the same file is observable. A model that emits stop-phrases (“shall I continue?”, “the simplest approach would be”) is observable. These signals do not require access to thinking tokens. They are behavioral signatures of shallow reasoning that can be caught by hooks.
This is where the design philosophy shifts from “make the model think harder” (which you cannot control) to “detect and contain the damage when it does not” (which you can).
zat.env was already pointed in this direction before the incident. Effort: max on all review and spec skills. A content-addressed pre-push gate that blocks code from leaving the machine without passing review. Spec-driven acceptance criteria that constrain “simplest fix” drift. Small increments with test verification and a two-failure circuit breaker (see The Bitter Lesson of Agentic Coding). These are defenses against variable output quality at the artifact boundary. The opportunity is to finish that thought and move from “good gated workflow” to “workflow with shallow-turn detectors.”
What We Changed
After analyzing the incident, the community workarounds, and what zat.env’s architecture already provides, we made three small changes. The decision process is more interesting than the changes themselves, because the most important decisions were about what not to do.
Bumped effortLevel to high. The install script now sets effortLevel: "high" in settings.json, overriding the silent medium default. This protects implementation turns between skill checkpoints. Skills already override to effort: max via frontmatter at critical decision points (spec definition, code review, security review, architecture assessment). The effort level setting is a first-class, versioned, documented configuration key. It is not a prompt. It is infrastructure.
Enabled showThinkingSummaries. The install script now sets showThinkingSummaries: true in settings.json, reversing the default redaction. This is pure observability: it makes thinking visible in the UI so a human can notice when reasoning is shallow. It costs nothing and provides an ambient signal.
Added claude-fixed-reasoning convenience script. A three-line shell script in ~/bin/ that launches Claude Code with CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING=1. This forces a fixed reasoning budget instead of per-turn adaptive allocation. It exists as an opt-in tool choice, not a global default.
That last point is the important design decision. We deliberately did not apply the adaptive thinking disable globally. Here is why.
What We Rejected and Why
Global CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING=1. This was the most-discussed workaround in the community, and the one Boris specifically recommended after confirming the zero-reasoning bug. But disabling adaptive thinking on Opus 4.6 loses interleaved thinking entirely and reverts to a deprecated code path. The tradeoff is not just “30-40% more tokens.” It is a qualitative regression on the model’s architecture. Meanwhile, zat.env’s pre-push gate already contains bad implementation turns at the output boundary: nothing ships without review at effort: max. The cost of a shallow turn in a gated workflow is wasted tokens and time, not shipped bad code. A convenience script lets the human opt into fixed reasoning for specific sessions where thrash is unacceptable. A global default would impose the cost everywhere, including turns where adaptive allocation works fine.
Instruction-based workarounds. The Reddit and dev.to communities generated dozens of CLAUDE.md instructions aimed at counteracting shallow reasoning: “prefer correct, complete implementations over minimal ones,” “fix the root cause, not the symptom,” “use appropriate data structures.” We evaluated these carefully and rejected all of them.
The reasoning: these are prompt-level requests competing with Claude Code’s own system prompt, which includes language like “Return the simplest working solution.” CLAUDE.md content is wrapped in a system reminder that explicitly tells the model it “may or may not be relevant.” Community research indicates CLAUDE.md compliance is approximately 70-80%. On shallow turns, where the model is reasoning minimally, prompt instructions are least likely to be followed. On deep turns, where the model is reasoning fully, these instructions are superfluous because the model already does them.
More fundamentally, zat.env handles these failure modes with stronger mechanisms. “Be complete” is weaker than a spec with concrete acceptance criteria. “Fix root cause” is weaker than a code review gate that catches band-aid fixes. Instructions are suggestions. Hooks are enforcement. The difference matters exactly when the model is reasoning shallowly, which is exactly when instructions are most needed and least effective.
There is also a budget constraint. Community research converges on an instruction ceiling of approximately 150-200 distinct rules that frontier LLMs reliably follow. Claude Code’s system prompt consumes roughly 50 of those. Each additional CLAUDE.md instruction competes for the remaining budget. Adding instructions that duplicate what the review gate already enforces is not free: it dilutes the instructions that are not enforced elsewhere.
Behavioral hooks (edit-without-read detection, stop-phrase guard). These are the most architecturally interesting mitigations. Ben Vanik’s stop-phrase-guard hook caught 173 violations with zero false positives. An edit-without-read hook would enforce the discipline that collapsed from a 6.6 to 2.0 read:edit ratio. Both fit naturally into zat.env’s existing hook architecture (bash scripts, registered in settings.json).
We deferred these, not rejected them. They are the right direction but require careful design: state tracking across tool calls for read-before-edit, false-positive tuning for stop-phrase detection, and interaction testing with the existing hook pipeline. They are candidates for the next iteration, not responses to deploy alongside a settings change.
The Principle
The incident is specific to Claude Code, adaptive thinking, and a particular set of configuration changes. But the lesson generalizes.
The parallel to traditional infrastructure is exact. We do not build web services assuming the network is always fast. We build them assuming the network is variable: sometimes fast, sometimes slow, sometimes down. We add timeouts, retries, circuit breakers, and health checks. We design for degraded operation because degraded operation is a certainty, not an edge case.
Agentic coding harnesses need the same mindset. Design for variable reasoning quality. Assume some turns will be shallow. Gate important transitions on high-effort verification. Keep the blast radius of any single bad turn small through incremental work and frequent testing. And when a specific, confirmed bug makes the variability worse than it should be, give the human a tool to opt into a different allocation strategy for that session, rather than imposing a global workaround that trades one set of capabilities for another.
The response is not “stuff the harness with more prompt text.” Prompts are suggestions. Hooks are enforcement. The difference matters exactly when the model is reasoning shallowly, which is exactly when prompts are least likely to be followed and hooks are most needed.
The model will get better. The adaptive thinking bug will probably be fixed. The effort default may be reconsidered. But the structural property, that reasoning quality is variable and sometimes misallocated, will remain true for any adaptive system. Building defenses against that property is a durable investment.
April 13, 2026: What We Know Now
Six days after the original post, the story has expanded. The adaptive thinking regression was not the only silent change in March 2026. A parallel investigation uncovered a prompt cache TTL regression, and Boris Cherny’s public comments on both issues have filled in architectural details that were invisible at the time of writing.
A Fourth Change: The Cache TTL Regression
The original post identified three changes in February and March that together produced the quality regression. There was a fourth.
On March 6, Anthropic silently changed how Claude Code selects prompt cache TTL (time-to-live) per request. A forensic analysis of 119,866 API calls across January through April, filed as GitHub issue #46829 by Sean Swanson, showed the transition precisely:
- February 1 through March 5: 100% of cache tokens used 1-hour TTL, zero exceptions across 33 consecutive days.
- March 6-7: Transition period, 5-minute tokens reappear.
- March 8 onward: 5-minute tokens dominate (83-93% of cache writes).
The 1-hour cache costs more to write (2x base input price vs 1.25x) but the same to read (0.1x). The tradeoff depends on whether cached content is reused within the hour. Anthropic’s Jarred Sumner called it “ongoing optimization work” and argued the change is net cheaper because many requests are one-shot calls where the 1-hour write premium is wasted. He is probably right for API users who pay per token. He is probably wrong for subscription users who are quota-limited: more cache misses mean more cache_creation tokens, which burn quota faster regardless of per-token price.
Boris confirmed the architecture in a follow-up comment: the main agent uses 1-hour cache, subagents use 5-minute cache, and the client selects TTL per request based on expected reuse pattern. He also revealed a coupling that nobody had documented: when users disable telemetry, experiment gates are also disabled, and the default TTL falls to 5 minutes. Privacy-conscious users silently get worse cache behavior. A fix (changing the client-side default) was described as imminent.
The Source Code Leak
On March 31, an npm build error accidentally exposed Claude Code’s 510,000-line TypeScript codebase. Community analysis found two mechanisms that break prompt cache matching:
- Attestation data (anti-abuse proof) attached to every request contains values that differ per request, invalidating cache.
- Anti-distillation tools: a flag injects varying fake tool definitions into the system prompt, triggering cache invalidation.
The analysis also noted zero tests for 64,464 lines of code. This is relevant not as gossip but as structural context: the cache optimization work Anthropic describes as “ongoing” is happening in an untested codebase where silent behavioral changes are easy to introduce and hard to catch.
Boris’s Evolving Position, Continued
In the original post, I traced how Boris’s position shifted over a single day from “just change your settings” to “adaptive thinking is allocating zero reasoning on certain turns.” The arc continued.
On April 12-13, responding to the quota exhaustion issue (#45756, 708 points on HN), Boris identified prompt cache misses on 1M context windows as the primary cost driver. He proposed reducing the default context window from 1M to 400k tokens, and described plans for “better UX to make this visible and more intelligent truncation/pruning.”
He also stated, without elaboration: “We ruled out adaptive thinking, other kinds of harness regressions, model and inference regressions” as causes of the quota issue. This is consistent: the quota problem is a cache/cost issue, not a reasoning quality issue. But the two compound. Users who burn quota faster get fewer total turns, and if some of those turns also have shallow reasoning due to the adaptive thinking bug, the experience is doubly degraded.
The pattern across both incidents is the same: transparency arrives after community forensics, not before. Users with logging infrastructure (Ben Vanik’s stop-phrase hooks, Sean Swanson’s JSONL analysis, cnighswonger’s cache interceptor) diagnosed the problems. Anthropic’s public explanations came in response to evidence that was already public. This is not unusual for infrastructure providers, but it reinforces the design principle from the original post: observability at the boundary is not optional, because the provider’s internal state is opaque and changes without notice.
What the Original Post Got Right
The central thesis holds up better than I expected.
“Agent harnesses should assume reasoning quality is variable and sometimes misallocated.” The cache TTL regression adds a second axis of variability. It is not just thinking depth that fluctuates; the cost and latency characteristics of the underlying infrastructure also shift silently. The design response is the same: gate important transitions on verification, not on assumptions about internal state.
“The response is not ‘stuff the harness with more prompt text.’” The community continued generating CLAUDE.md instructions through April. None of the users who reported recovery attributed it to prompt changes. The users who recovered attributed it to CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING=1, /effort max, or shorter context windows. Infrastructure changes fixed infrastructure problems. Prompts did not.
“Prompts are suggestions. Hooks are enforcement.” Still the most load-bearing sentence in the piece. The cache TTL coupling with telemetry is a perfect illustration: users who thought they were making a privacy choice were unknowingly making a performance choice. No prompt instruction can compensate for 5-minute cache expiry on a 1M context window. Only infrastructure-level awareness of the tradeoff helps.
The three specific changes we made (effortLevel: high, showThinkingSummaries: true, claude-fixed-reasoning convenience script) were correctly scoped. None needed to be revised. The decision to not globally disable adaptive thinking was vindicated: the convenience script gives the human a session-level choice, while the pre-push gate contains bad turns regardless.
What the Original Post Missed
The cost dimension. The original post framed the problem entirely as reasoning quality. The cache TTL regression shows there is a parallel cost/quota story: silent infrastructure changes can make the same workflow burn resources faster without changing output quality at all. A complete harness design needs to account for both. You can have good reasoning quality and still burn your quota in 90 minutes if cache behavior changes underneath you.
The telemetry trap. The coupling between telemetry opt-out and degraded cache behavior was not known at the time. It is a specific instance of a general pattern: provider-side feature flags that bundle unrelated behaviors. Disabling telemetry should not change cache TTL. That it does is an implementation detail that leaks into user-facing quality.
Subagent cost amplification. The original post did not discuss subagent architecture. Community analysis showed subagents were sometimes receiving entire session contexts, and their 5-minute cache TTL means any pause between subagent turns triggers full cache recreation. This is relevant to zat.env because the skill architecture spawns subagents for review, spec, and security passes. The cost of those subagent turns is not just the tokens they consume, but the cache writes they trigger.
What This Means for zat.env
The original post described what zat.env already had in place. Here is the updated picture, separating what was already mitigated from what remains exposed.
Already mitigated, no changes needed:
effortLevel: highin settings.json overrides the silent medium default. Skills override toeffort: maxat critical checkpoints (spec, review, security, architecture). This layer is working as designed.showThinkingSummaries: truerestores thinking visibility. Still useful as an ambient signal.- The pre-push review gate catches bad code at the artifact boundary regardless of thinking depth, cache behavior, or any other invisible infrastructure change. This is the most durable defense and the one least affected by anything discovered since April 7.
- Small increments with test verification and the two-failure circuit breaker keep the blast radius of any single bad turn small. A cache miss that causes a shallow turn wastes one increment, not an entire session.
claude-fixed-reasoningconvenience script remains correctly scoped as an opt-in tool, not a global default.
Still exposed, candidates for the next iteration:
The behavioral hooks deferred in the original post are now more strongly justified. The cache TTL data shows that turns after cache misses (any pause longer than 5 minutes for subagents, or longer than 1 hour for the main agent) are structurally different from cache-hit turns: the model reconstructs context from scratch, and the first turn after a miss is where shallow reasoning is most likely to appear. An edit-without-read hook would catch the specific failure mode (read:edit ratio collapse) that the original data showed.
Context window management is the new frontier. CLAUDE_CODE_AUTO_COMPACT_WINDOW=400000 reduces the cost of cache misses by 60% (400k vs 1M context). This is a blunt instrument, trading context depth for cache resilience, but it is the kind of tradeoff a harness should make explicitly rather than leaving to the provider’s default. Whether zat.env should set this globally or offer it as a configurable default is an open question.
The telemetry coupling should be documented. If a zat.env user disables telemetry (a reasonable privacy choice), they should know they are also getting 5-minute cache TTL. The install script could detect the telemetry setting and warn, or set an explicit cache TTL override when one becomes available.
The Broader Pattern
The original post’s concluding principle was: “The model will get better. The adaptive thinking bug will probably be fixed. But the structural property, that reasoning quality is variable and sometimes misallocated, will remain true for any adaptive system.”
Six days later, the principle extends: the structural property is not just about reasoning quality. It is about every dimension of the provider’s infrastructure that is invisible to the user: cache TTL, context window behavior, experiment gates, subagent routing, quota accounting. All of these change without notice. All of them affect the user’s experience. None of them are observable from outside unless you build the instrumentation.
The users who navigated both incidents successfully were the ones with logging, hooks, and session analysis tooling. The users who struggled were the ones who trusted the provider’s defaults and had no way to detect when those defaults changed. This is the same lesson the original post drew about thinking depth, applied one level deeper to the infrastructure underneath.
Build the observability. Gate the outputs. Assume the floor will move.
References (original):
- Vanik, B. (2026). Extended Thinking Is Load-Bearing for Senior Engineering Workflows. GitHub issue with quantitative session log analysis of Claude Code behavioral regression.
- Vanik, B. (2026). stop-phrase-guard.sh. Hook script catching ownership-dodging and premature-stopping patterns.
- Cherny, B. (2026). HN comment on adaptive thinking under-allocation. Confirms adaptive thinking allocating zero reasoning on turns that needed it.
- Hacker News (2026). Claude Code is unusable for complex engineering tasks with Feb updates. 1,209-point discussion thread.
- Carlini, N. (2026). Building a C Compiler with a Team of Parallel Claudes. The demonstration that verification loop quality determines the ceiling of autonomous output.
- Rajasekaran, P. (2026). Harness Design for Long-Running Application Development. Generator-evaluator architecture for extended autonomous sessions.
References (April 13 update):
- Swanson, S. (2026). Cache TTL silently regressed from 1h to 5m around early March 2026. Forensic analysis of 119,866 API calls showing the March 6 cache TTL transition.
- Sumner, J. (2026). Comment on cache TTL issue. Anthropic staff confirming the March 6 change was intentional and explaining per-request TTL selection.
- Cherny, B. (2026). HN comment on cache architecture. Describes 1h/5m TTL tiers, experiment gate coupling with telemetry, and planned env var overrides.
- Hacker News (2026). Pro Max 5x Quota Exhausted in 1.5 Hours Despite Moderate Usage. 708-point discussion thread on cache miss costs and quota exhaustion.
- Reddit r/ClaudeCode (2026). Boris is claiming that Claude Code has a one hour cache. Community discussion of Boris’s cache TTL claims and independent verification.