Original Title: "The Mirage of the Three Provinces and Six Ministries: Why the 'Virtual Company'-Style Multi-Agent Architecture Doesn't Work in Engineering"
An architecture concept widely circulated in the AI community is leading many teams astray.
Let's Start with the Conclusion
If you are considering naming multiple AI agents as "product managers," "architects," "test engineers," and having them pass documents and collaborate on tasks like company departments, please stop.
This pattern may seem intuitive and logically sound, but it has fundamental flaws in engineering. More importantly, none of the three companies, Anthropic, OpenAI, or Google, adopted this pattern when building their own agent systems.
This is not a coincidence.

This metaphor refers to a type of widely popular multi-agent design approach in the community, known by different names in various frameworks and articles: role-based agents, virtual team, CrewAI-style division of labor, MetaGPT-style organization—referred to as "the Three Provinces and Six Ministries" in this article.
The core pattern is: break down a complex task into various functions, with each agent playing a role—PM responsible for requirements, Tech Lead for architecture, Dev for implementation, QA for testing. Tasks flow between agents like on an assembly line.
This pattern looks very appealing in diagrams. It satisfies human intuition for "division of labor" and makes the concept of an "AI team" tangible and explainable. Frameworks like CrewAI have accumulated a large user base due to this.
The issue is that this addresses human bottlenecks, not AI bottlenecks.
Humans need division of labor because:
· An individual's attention is limited and cannot process all information simultaneously
· Humans have professional barriers to entry, and the cost of switching is high
· Interoperability is needed between individuals
But the characteristics of LLM are completely different:
· The same model can both write PRD and write code, without a "professional boundary"
· The bottleneck of the model is not the breadth of attention, but the depth of reasoning and completeness of information
· There is no "culture" and "understanding" between models to compensate for information loss
Labeling an Agent as a "product manager" will not make it more professional—instead, it will make it reject crossing boundaries. An Agent pigeonholed in the role of a "test engineer" may overlook architectural issues because they are "not within my scope of responsibility." The most valuable reasoning often occurs at the boundaries, yet the three-province and six-department model at the system level blocks this possibility.
Role-playing creates false boundaries. This is the first issue.

In the three-province and six-department model, Agent A produces a document and passes it to Agent B.
This process conveys conclusions, not reasoning processes.
When B receives the document, they reinterpret and rebuild the context. The original intent degrades, implicit assumptions are lost, and each transfer accumulates errors. The longer the workflow, the more the final output tends to be "locally correct but globally drifted"—each node seems reasonable, but the whole has deviated from the initial goal.
Human organizations rely on meetings, culture, and informal communication to compensate for this information loss. Agents lack these mechanisms.
Here is a common rebuttal: Aren't the solutions of the three vendors (progress.txt, spec file, runbook) also "sending files"? What is the difference?
The difference lies in who is writing, writing to whom, and how updates are done.
The information flow in the three-province and six-department model is unidirectional handover between roles: A finishes writing and hands it to B, B does not look back, and A does not know how B used that document. Information is compressed into conclusions, the reasoning process is lost, and handing over is the breakpoint.
An external state file is the incremental log of the same task: the executor appends to the same record at each checkpoint, and the next session reads the full history of the task, not the output conclusion of the previous "colleague". The writer and the reader of the state are the same role, just at different times. The information is not "compressedly passed" but "continuously accumulated".
This difference determines whether the inference chain can be kept continuous across sessions.
A large number of tokens are wasted on "handoff files" between agents instead of being used for actual reasoning. What you end up with is a system simulating corporate behavior rather than a problem-solving system.
Notably, when Anthropic, OpenAI, and Google actually built their production-grade agent systems, their engineering documentation hardly contains any mention of "role-playing" or "departmental division."
Anthropic: Context Engineering + Explicit State Files
Anthropic internally upgraded "Prompt Engineering" to "Context Engineering": the question is not how to write a good prompt but what kind of token configuration can best generate the desired behavior.
When building the Claude Code and Research system, they faced the core challenge that the agent must work in discrete sessions, with each new session having no memory of what happened before. Their metaphor is that of "shift engineers" — each new shift engineer knows nothing about the previous shift's work.
The solution is not to have the agent play different roles but:
· claude-progress.txt: a cross-session work log that the agent updates at the end of each session, which is then read at the start of the next session
· Git history: serving as a state anchor, recording each incremental change
· Initializer Agent: Runs only in the first session to set up the environment, expand the feature list, and prepare the runbook for use by all subsequent sessions

Key Insight: The continuity of reasoning chains does not rely on the model "remembering," but on anchoring to explicit external states.
They also found that hardcoding the "model capability assumption" into the harness was risky. Sonnet 4.5 exhibited "context anxiety" — it would wrap up early when approaching the context limit, so a context reset was added to the harness.
However, this behavior disappeared in Opus 4.5, and the reset turned into dead weight. This indicates that the harness needs to evolve with model iterations, and any "permanent solution" is merely a current-stage engineering compromise.
In a multi-agent Research system, Anthropic's architecture follows an orchestrator-worker model: a lead agent decomposes tasks, coordinates subagents, and subagents concurrently explore different directions, with results flowing back to the lead agent for synthesis.
They discovered that the token consumption itself explained 80% of the performance delta — the value of multi-agent systems lies not in "division of labor" but in covering a larger search space with more tokens.
There is an easily confused point here: Anthropic's subagents may appear as "division of labor," but they are fundamentally different. The Taylorism system involves functional division of labor — different roles perform different tasks, where the PM completes a task and passes it to Dev, and Dev, once completed, hands it to QA, with each role processing only a segment of the pipeline.
Anthropic's subagents represent functional parallelism — multiple agents of the same nature simultaneously explore different directions, without a "next baton," and all results converge back to the same orchestrator for synthesis. The former is a relay race, and the latter is casting a net to fish simultaneously.

OpenAI's principle of enduring tasks is more direct: plan for continuity from the start of the task.
In their Codex experiment, engineers gave the agent a spec file (a frozen target to prevent the agent from "doing something very impressive but in the wrong direction"), had it generate a milestone-based plan, and then instructed the agent on how to operate using a runbook file. This runbook also served as a shared memory and audit log.
Result: GPT-5.3-Codex ran continuously for about 25 hours, completing a full design tool while maintaining coherence throughout.
Server-side compaction is the default primitive, not an emergency fallback. In multi-step tasks, the `previous_response_id` allows the model to continue working in the same thread, rather than rebuilding the context each time.
They also introduced the concept of Skills — reusable, versioned instruction sets mounted in a container, providing the agent with a stable operational framework for specific tasks. This is not a "role"; it's tools and operational procedures, which are fundamentally different.
Google: 1M Context + Context-driven Development
Google's direction is wide window scaling: Gemini's 1M token context is a clear differentiation strategy. Their reasoning is that previously forced techniques like RAG slicing, discarding old messages, etc., can be replaced by "directly putting them in" with a large enough window.
But they also admit that this is not sufficient. Google introduced the Conductor extension in Gemini CLI, with a core idea similar to Anthropic's: moving project intent out of chat windows and into persistent Markdown files in the codebase. The philosophy is to "not rely on unstable chat records but on formal spec and plan files."
Gemini 3 also introduced the Thought Signatures mechanism: saving key nodes of the reasoning chain during a long session to prevent "reasoning drift" — the issue of inconsistency in logic over a long context.
From the engineering practices of the three companies, several common principles can be distilled:
The reasoning chain cannot break; it can only fork and then merge. The correct use of multiple Agents is not a pipeline but rather a main agent holding the complete intent, where sub-calls are for deep diving into a specific sub-issue, with results flowing back to the main agent instead of being passed to the next agent.
Explicit external state, not relying on the model to remember. progress.txt, Git history, spec files, databases — the form is not important; the principle is that key nodes of the reasoning chain must be externalized into persistent storage and not rely on the model to "remember" within the context window.
The value of multiple Agents lies in parallel coverage, not division of labor. Anthropic Research's system conclusion is clear: performance improvement mainly comes from "spending more tokens" rather than from "more reasonable division of labor." Multiple Agents are suitable for breadth-first type tasks — scenarios that require simultaneously exploring multiple independent directions. They are not suitable for scenarios that require continuous reasoning and deep contextual dependencies.

The validating Agent is a challenger, not a successor. If multiple Agents are used for quality control, the correct design is for one Agent to specifically look for another Agent's issues, rather than "passing on work results." Adversarial inspection, not pipeline transmission.
Tools are tools, not roles. It is far more important to give an Agent what tools (bash, file I/O, search, code execution) it has than to label it a specific role. The tools determine what an Agent can do; role labels only constrain what it is willing to do.
Why did the Three Departments and Six Ministries become popular?
Because it is easily explainable.
“This Agent is the PM, and that one is the QA” — this sentence is understandable to anyone. It satisfies humans' desire for explainability in AI systems and also fulfills management's imagination of “AI working like a team.”

It also looks good. When drawn in a flowchart with departments, arrows, and handoffs, it is very intuitive.
However, being easy to explain and display nicely is one thing, and whether it is sound in engineering terms is another.
The deeper reason is that most teams adopting this model have not truly faced the “information loss when passing context among multiple Agents” issue. Their tasks may not be complex enough, or the issue may be masked by other factors. It is only when the complexity of the task increases, and the system starts exhibiting strange “local right global wrong” behaviors that the problem is exposed.
The best multi-Agent system is not like a company. It is more like a thinker's multiple drafts — the same mind engaging in reasoning on different dimensions, ultimately converging into a coherent conclusion.
From this principle:
Do not ask “How many Agents do I need,” ask “What is the information dependency structure of this task.”
If the task requires continuous reasoning, with a high context dependency (such as writing a complex feature design document), a single Agent + good context engineering is usually superior to multiple Agents.
If the task requires exploring multiple independent directions simultaneously (such as researching different modules of 10 competitors at the same time), parallelizing multiple Agents is reasonable — each subagent's task is independent of the others, and the cost of information loss is minimal. This is precisely the reason behind the 80% performance difference explained by the Anthropic Research system token volume: it is not division of labor that makes it better, it is the broader search coverage that makes it better.
If the task spans multiple sessions, an external state file is necessary. An effective state file should contain four types of information:
· Task Objective (immutable, read at the start of the session to prevent drift)
· Completed Steps (append only, not overwrite, retaining a full history)
· Current Status (Coverage, Reflecting the Latest Developments)
· Known Pitfalls (Additional, Avoiding Redundant Missteps in the Next session)
These four types of information are maintained separately, but together they form the complete context needed for the "next self."
If adding a validation step, ensure that the validation agent's sole task is to find issues, not to "take over and continue." Adversarial validation is not a handover on the assembly line.

Final note: Model capabilities are advancing rapidly. Workarounds needed in today's harness may become dead weight six months from now. Anthropic has already confirmed this—Sonnet 4.5's context anxiety disappeared in Opus 4.5, rendering the designed context reset obsolete code. Maintaining the evolvability of the architecture is more important than selecting a "perfect architecture."

Three Provinces and Six Ministries is an illusion that feels good but comes with a high engineering cost. Its true cost is not outright failure but rather causing your system to degrade in a difficult-to-diagnose manner as complexity increases—each node appears to be "working," but the system as a whole is drifting.
By the time you discover an issue, the assembly line is already quite long.
References:
Anthropic Engineering Blog (Building Effective Agents, Effective Context Engineering, Multi-Agent Research System, Effective Harnesses for Long-Running Agents, Managed Agents); OpenAI Developers Blog (Run Long Horizon Tasks with Codex, Shell + Skills + Compaction); Google Developers Blog (Architecting Efficient Context-Aware Multi-Agent Framework, Conductor: Context-Driven Development for Gemini CLI)
Original Article Link
Welcome to join the official BlockBeats community:
Telegram Subscription Group: https://t.me/theblockbeats
Telegram Discussion Group: https://t.me/BlockBeats_App
Official Twitter Account: https://twitter.com/BlockBeatsAsia