header-langage
简体中文
繁體中文
English
Tiếng Việt
한국어
日本語
ภาษาไทย
Türkçe
Scan to Download the APP

Claude Always Makes Coding Mistakes? These 12 Rules Reduced the Error Rate to 3%

Read this article in 30 Minutes
From 41% to 3%, Karpathy's 4 Rules Are Not Enough
Original Title: Karpathy's 4 CLAUDE.md rules cut Claude mistakes from 41% to 11%. After 30 codebases, I added 8 more
Original Author: @Mnilax
Translator: Peggy, BlockBeats


Editor's Note: In January 2026, Andrej Karpathy's critique of Claude's coding practices highlighted a seemingly minor, yet crucial file in the AI programming workflow: CLAUDE.md. Forrest Chang subsequently distilled these issues into 4 behavioral rules, attempting to constrain common mistakes made by Claude during coding: silent assumptions, over-engineering, collateral damage to unrelated code, and a lack of clear success criteria.


However, several months later, the use cases for Claude's code had expanded beyond just "make the model write some code." With multi-step Agents, hook chaining, skill loading, and collaboration across multiple codebases becoming the norm, new failure modes began to emerge: models spiraling out of control in long tasks, passing tests without validating real logic, silent errors skipped during migration, and incorrect mixing of different code styles.


The author of this article tested 30 codebases within 6 weeks and added 8 additional rules on top of Karpathy's original 4 rules, attempting to address new challenges in AI programming transitioning from single-shot completions to Agent-based collaboration.


The following is the original text:


In late January 2026, Andrej Karpathy tweeted a thread criticizing Claude's coding style. He pointed out three typical issues: making erroneous assumptions without clarification, overcomplication, and causing irrelevant disturbances to code that should not have been altered.


Forrest Chang saw this tweet thread, distilled the complaints into 4 behavioral rules, wrote them into a separate CLAUDE.md file, and published it on GitHub. The project received 5,828 stars on its first day online, was forked 60,000 times in two weeks, and now has 120,000 stars, becoming the fastest-growing single-file code repository of 2026.



Subsequently, I spent 6 weeks testing it with 30 codebases.


These 4 rules are indeed effective. The error rate that used to occur in about 40% of cases has now dropped to less than 3% in tasks suitable for these rules. However, the issue is that this template was originally designed to address errors Claude encountered while coding in January.


By May 2026, the issues faced by Claude Code's ecosystem were different: conflicts between Agents, hook chain triggering, skill loading conflicts, and interruptions in cross-session multi-step workflows, among others.


Therefore, I added 8 more rules. Below is the full 12-rule version of CLAUDE.md: why each one is worth adding and where the original Karpathy template quietly fails in those 4 places.


If you want to skip the explanation and use it directly, the full file is at the end of the document.


Why This Matters


CLAUDE.md by Claude Code is the most underestimated file in the entire AI programming stack. Most developers usually make three types of mistakes:


First, treating it as a junk drawer, cramming all their habits into it, eventually inflating it to over 4000 tokens, causing the rule compliance rate to drop to 30%.


Second, simply not using it at all and starting fresh every time. This leads to a waste of 5 times the tokens and a lack of consistency across different sessions.


Third, copying a template and then ignoring it. It may be effective for two weeks, but as the codebase changes, it silently becomes ineffective without you realizing it.


The Anthropic official documentation is very clear: CLAUDE.md is essentially just advisory. Claude follows it about 80% of the time. Once it exceeds 200 lines, the compliance rate noticeably decreases because critical rules get drowned out in noise.


Karpathy's template solved this issue: one file, 65 lines, 4 rules. That is the minimum benchmark.


But the upper limit can be higher. By adding the following 8 rules, it covers not only the code writing problem Karpathy complained about in January 2026 but also Agent orchestration issues that only arose in May 2026 — issues that did not exist when the original template was written.


The Original 4 Rules


If you haven't seen Forrest Chang's repository yet, start with this baseline version:


Rule 1: Think before coding.


Do not make silent assumptions. Clearly state your assumptions, expose trade-offs, ask questions before guessing, and proactively dissent when there is a simpler solution.


Rule 2: Prioritize simplicity.
Use the least amount of code that solves the problem. Avoid adding imaginary features and refrain from designing abstraction layers for one-time code. If a senior engineer would find it overly complex, simplify it.


Rule 3: Surgical code modifications.
Only change what is necessary. Avoid impulsively "optimizing" neighboring code, comments, or formatting. Do not refactor things that are not broken. Maintain consistency with the existing style.


Rule 4: Goal-oriented execution.
Define success criteria first, then iterate until validation is complete. Rather than telling Claude how to do each step, describe what a successful outcome should look like and let it iterate autonomously.


These four rules address approximately 40% of the failure patterns I observed in unsupervised Claude Code sessions. The remaining 60% of issues are hidden in the following blank areas:



The 8 additional rules I introduced, and why


Each rule stems from a real moment: Karpathy's original 4 rules were no longer sufficient. I will first describe that scenario below and then provide the corresponding rule.


Rule 5: Do not let the model perform non-linguistic tasks


You can use Claude for: classification, drafting, summarization, and extracting information from unstructured text. Do not use Claude for: routing, retrying, status code handling, and deterministic transformations. If a status code has already answered the question, let regular code answer it.


Karpathy's rules did not cover this aspect. Consequently, the model started making decisions on issues that deterministic code should handle: whether to retry an API call, how to route a message, when to escalate processing. The result was a weekly-changing judgment. What you got was a kind of unstable if-else charged at $0.003 per token.


The moment was like this: There was a piece of code that would call Claude to "decide whether to retry upon encountering a 503 error." It initially worked well, lasted for two weeks, but then suddenly became unstable because the model started treating the request body as part of the decision context. The retry strategy became random because the prompt itself was random.


Rule 6: Establish a Hard Token Budget, No Exceptions


Individual Task Budget: 4,000 tokens. Single Session Budget: 30,000 tokens. If a task nears the budget limit, summarize the current state and restart. Do not push through. It is better to explicitly acknowledge a budget overrun issue than to silently overspend.


A CLAUDE.md without a budget constraint is equivalent to a blank check. Each iteration has the potential to spiral out of control, becoming a 50,000-token context dump. The model will not stop itself.


It was a moment like this: a debugging session had been going on for 90 minutes. The model kept iterating around the same 8KB snippet of error information, gradually forgetting which fixes it had already attempted. Eventually, it started suggesting 40 messages for solutions I had already ruled out. If there had been a token budget, this process should have been terminated by the 12th minute.


Rule 7: Expose Conflicts, Do Not Compromise


If two existing patterns in the codebase are in conflict, do not mix them together. Choose one pattern, preferably the updated or more tested one, explain the rationale, and mark the other pattern for future cleanup. Attempting to create "average code" that satisfies both sets of rules is the worst kind of code.


When two parts of the codebase are at odds, Claude tries to please both sides, resulting in a mess of inconsistent code.


It was a moment like this: there were two error handling patterns in a codebase, one using async/await with explicit try/catch and the other using a global error boundary. The new code Claude wrote used both sets. As a result, error handling was duplicated. It took me 30 minutes to figure out why errors were being silenced twice.


Rule 8: Read Before You Write


Before adding code to a file, read the file's exports, direct callers, and any obviously relevant shared utility functions. If you do not understand why the existing code is organized the way it is, ask questions first; do not add to it directly. "This seems unrelated to me" is the most dangerous phrase in this codebase.


Karpathy's "Surgical Code Modification" told Claude not to touch adjacent code. But it didn't tell Claude: first understand the adjacent code. Without this rule, Claude will write new code that conflicts with code beyond the 30-line mark.


It was at that moment when Claude added a new function next to an existing function that performed the exact same task, unaware of the original function. Both functions were doing the same thing. However, due to the import order, the new function overrode the old function, which had been the de facto standard for 6 months.


Rule 9: Testing is not optional, but the test itself is not the goal


Every test must encode "why this behavior is important," not just "what it does." Tests like `expect(getUserName()).toBe('John')` are worthless if the function actually takes a hard-coded ID. If you can't write a test that would fail when the business logic changes, then the function itself is wrong.


Karpathy's "Execute towards a goal" implies that testing can serve as a success criterion. However, in practice, Claude would consider "passing the test" as the sole goal, leading to code that passes shallow tests but breaks other things.


It was at that moment when Claude wrote 12 tests for an authentication function, and they all passed. However, the authentication logic in the production environment was broken. Those tests were merely validating that the function "returned something," not whether it returned the correct thing. The function was able to pass the tests because it was returning a constant.


Rule 10: Long-running operations require checkpoints


In multi-step tasks, after completing each step, summarize what has been done, what has been validated, and what is left to do. Do not continue from a state that you cannot clearly recount to me. If you find yourself lost, stop, and restate the current state.


Karpathy's default template for interaction is one-shot. However, real-world Claude Code work often involves multi-step processes: refactoring across 20 files, building a feature in a session, debugging across multiple commits. Without checkpoints, a misstep in one phase could jeopardize all the progress made so far.


It was at that moment when a 6-step refactoring task went wrong in the 4th step. By the time I discovered it, Claude had already proceeded to complete the 5th and 6th steps in the faulty state. The time taken to undo and redo exceeded the time to redo the entire task. With checkpoints, the issue in the 4th step would have been identified.


Rule 11: Convention Over Configuration


If the codebase uses snake_case, and you prefer camelCase: use snake_case. If the codebase uses class-based components, and you prefer hooks: use class-based components. Differences of opinion are a separate discussion. Within a codebase, consistency takes precedence over individual preferences. If you genuinely believe a certain convention is harmful, speak up clearly. Do not silently fork the path.


In a codebase with an established pattern, Claude likes to introduce their own style. Even if their style is "better," introducing a second pattern is worse than any single pattern.


There was a moment when Claude introduced hooks in a React codebase based on class components. It did work. However, it also disrupted the codebase's existing testing pattern because those tests relied on componentDidMount. It took half a day to remove it and rewrite the code.


Rule 12: Fail Explicitly, Not Silently


If you are not sure if something has succeeded, say so explicitly. If 30 records were silently skipped, you cannot say "migration complete." If you skipped any tests, you cannot say "tests pass." If you didn't validate the edge cases I asked for, you cannot say "feature is available." Default to exposing uncertainty rather than hiding it.


Claude's most costly failures are often those that appear to be successful failures. A function "runs," but returns incorrect data; a migration "completes," but skips 30 records; a test "passes," but only because the assertion itself is wrong.


There was a moment when Claude claimed a database migration was "successfully completed." However, in reality, it silently skipped 14% of records triggering a constraint conflict. The skip behavior was logged but not explicitly exposed. It was only 11 days later, when the report data started to show anomalies, that we discovered the issue.


Data Results


Over a period of 6 weeks, I tracked the same set of 50 representative tasks, covering 30 code repositories, and tested three configurations.



The error rate refers to the percentage of tasks that require correction or rewriting to match the original intent. The included errors consist of: silent error assumptions, over-engineering, irrelevant disruptions, silent failures, covenant violations, trade-off conflicts, and missed checkpoints.


The compliance rate refers to the probability that when a rule applies, Claude will explicitly apply that rule.


The truly interesting result is not just the error rate dropping from 41% to 3%. More importantly, expanding from 4 rules to 12 rules, with almost no increase in compliance burden. The compliance rate only dropped from 78% to 76%, but the error rate decreased by 8 percentage points. The new rules cover failure patterns that the original 4 rules did not address, and they do not compete for the same attention budget.



Where the Karpathy Template Quietly Breaks Down


Even without adding new rules, the original 4-rule template falls short in at least 4 areas.


First, long-running Agent tasks.
Karpathy's rules are mainly focused on the moment Claude is writing code. But what happens when Claude is running a multi-step pipeline? The original template lacks budget rules, checkpoint rules, and "loud failure" rules. As a result, the pipeline slowly drifts.


Second, multi-codebase consistency.
By default, "match existing style" only assumes one style. But in a monorepo with 12 services, Claude must decide which style to match. The original rules do not tell it how to choose. So it either selects randomly or evenly blends several styles.


Third, test quality.
Executing "goal-directed testing" considers "test passed" as a success but does not specify that the test itself must be meaningful. The result is that Claude writes tests that verify almost nothing, yet these tests make it falsely confident.


Fourth, differences between production and prototype stages.
The same 4 rules can prevent production code from over-engineering but may also hinder prototype development. Because the prototype stage sometimes needs a 100-line exploratory scaffold to understand the direction first. Karpathy's "simplicity first" easily over-triggers in early-stage code.


These 8 new rules are not meant to replace Karpathy's original 4 rules but to fill their gaps: the original template corresponds to a more auto-complete-oriented code writing scenario in January 2026, while by May 2026, Claude Code has entered a more agent-driven, multi-step, multi-codebase collaborative environment, facing different challenges.



Methods That Did Not Work


Before settling on these 12 rules, I tried some other approaches.


Rules I found on Reddit / X.
Most of them either just rephrased Karpathy's original 4 rules or were domain-specific rules that could not be generalized, such as "Always use Tailwind class." In the end, all of them were removed.


More than 12 rules.
I tested up to 18 rules. Once it went over 14 rules, compliance dropped from 76% to 52%. The 200-line limit is real. Beyond this length, Claude starts pattern-matching for "here is a rule" instead of actually reading each one.


Rules that relied on certain tools.
For example, "Always use eslint" would become ineffective if eslint was not installed in the project, and it would fail silently. Later, I changed it to expressions independent of specific tools, such as changing "Use eslint" to "Follow the style already enforced in the codebase."


Examples in CLAUDE.md instead of rules.
Examples consume more context than rules. The context consumed by three examples is roughly equivalent to that of 10 rules, and Claude tends to overfit examples. Rules are abstract; examples are specific. Therefore, rules should be used.


"Be a little more careful," "Think carefully," "Focus a bit more."
These are all noise. Compliance with such instructions dropped to around 30% because they could not be verified. I later replaced them with more specific imperative rules, such as "Clearly state assumptions."


Telling Claude to act like a "Senior Engineer."
This was ineffective. Claude already believed it was acting like a senior engineer. The real issue is not whether it thinks so but whether it acts that way. Imperative rules can narrow this gap, identity prompts cannot.


Full Version of 12 Rules in CLAUDE.md


Below is the complete version that can be directly copied and pasted for use.


This content cannot currently be displayed outside of the Feishu document


Save it as CLAUDE.md in the root directory of the repository. Below these 12 rules, add project-specific rules such as tech stack, test commands, error patterns, etc. Keep the overall length under 200 lines; once exceeded, the compliance rate will significantly drop.


How to Install


Just two steps:


1. Append Karpathy's 4 foundational rules to your CLAUDE.md
curl https://raw.githubusercontent.com/forrestchang/andrej-karpathy-skills/main/CLAUDE.md >> CLAUDE.md


2. Copy and paste rules 5–12 from this article below


Save the file in the root directory of your repository. The >> here is important, it appends to the existing CLAUDE.md without overwriting the project-specific rules you've already written.


Mental Model


CLAUDE.md is not a wishlist; it's a behavioral contract designed to plug specific failure modes you have already observed.


Each rule should answer one question: What error does it prevent?


Karpathy's 4 rules guard against the failure modes he saw in January 2026: silent assumptions, overengineering, unrelated breakage, weak success criteria. These are foundational; don't skip them.


My additional 8 rules guard against new failure modes that emerged after May 2026: unconstrained agent loops, checkpoint-less multi-step tasks, tests that superficially test but miss key logic, and the problem of packaging silent failures as silent successes. These are incremental patches.


Of course, the specific impact varies. If you don't run multi-step pipelines, Rule 10 may not be as critical for you. If your codebase already has a single uniform style and is lint-enforced, Rule 11 is redundant. After reading these 12 rules, keep those that truly correspond to the mistakes you have actually made, and delete the rest.


A 6-rule version of CLAUDE.md tailored to real failure modes is better than a version with 12 rules, 6 of which you will never use.


Conclusion


That tweet by Karpathy in January 2026 was essentially a rant. Forrest Chang turned it into 4 rules. Ultimately, 120,000 developers starred that outcome. Yet most people today still only use those 4 rules.


The model has advanced, and the ecosystem has evolved. Concepts like Multi-Step Agents, Hook Chain Triggers, Skill Loading, and Multi-Repository Collaboration were not in existence when Karpathy tweeted that. The original 4 rules did not address these issues. They were not wrong; they were simply incomplete.


Enter the 8 new rules. After 6 weeks, testing across 30 repositories. The error rate dropped from 41% to 3%.


Tonight, bookmark this article and paste these 12 rules into your CLAUDE.md. If it saves you a week of Claude detours, feel free to share.


[Original Article Link]



Welcome to join the official BlockBeats community:

Telegram Subscription Group: https://t.me/theblockbeats

Telegram Discussion Group: https://t.me/BlockBeats_App

Official Twitter Account: https://twitter.com/BlockBeatsAsia

举报 Correction/Report
Choose Library
Add Library
Cancel
Finish
Add Library
Visible to myself only
Public
Save
Correction/Report
Submit