Original Title: Avoiding Death on the Yellow Brick Road
Original Author: Joe Schmidt IV, a16z
Translation: Peggy
Editor's Note: As large-scale model capabilities continue to improve, the AI application layer is facing a common anxiety: if companies like OpenAI and Anthropic not only control the underlying models but also have distribution channels and brand advantages, what can startups do at the application layer?
This is precisely the question a16z partner Joe Schmidt attempts to answer in this article. Drawing on the metaphor of the "Yellow Brick Road" from "The Wizard of Oz," he divides AI application opportunities into two categories: one is the main road that large model companies are personally entering, such as code generation, writing, image generation, general-purpose agents, and horizontal office assistants; the other is the "other parts of Oz," referring to those vertical scenarios that delve into industry processes, rely on complex workflows, data accumulation, compliance governance, and system integration capabilities.
In his view, the real opportunity for startups lies in the latter.
From sales to insurance, Joe Schmidt repeatedly emphasizes the same logic: what enterprises are truly willing to pay for is not a smarter chat window but a system that can be accountable for business outcomes. It needs to understand the messy state of customer data, deal with multi-person approvals and edge cases, take on compliance and audit responsibilities, and also, as the model continuously upgrades, help clients with migration, routing, and cost optimization.
This is also the core judgment of this article on the next generation of enterprise software: underlying models will become stronger and more replaceable; however, what is truly irreplaceable is the data, processes, governance capabilities, and operational memory that have been solidified around specific industries and workflows. The opportunity for AI application companies lies not in competing for the "Yellow Brick Road" with model companies, but in entering those more complex, dirtier, slower, but also closer to real business value places.
The following is the original text:
Lately, I have repeatedly heard the same question from founders and potential employees: What else can be done in the AI application layer? Or, in other words, will OpenAI and Anthropic ultimately kill everything?
Behind this question is a very typical AI-style anxiety. Some have concluded that if you don't want to be relegated to the eternal substrate, the only position of long-term value is either inside a large model lab or in entrepreneurship in robotics, hard tech, or similar cutting-edge fields—essentially, doing things that the "lab can't touch." Because if every type of software is going to be devoured, either directly absorbed by Codex or Claude in their respective work or rendered unnecessary by a future model, then the best choice seems to be: run fast!
I admit, I'm almost an AI maximalist myself, and I think they got half of it right. The large model labs are indeed entering a vast swath of the application layer. But the "application layer" is not a homogenized set of opportunities. The truly important criterion is: are you on the "Yellow Brick Road," or somewhere else in Oz.
Note: The "Yellow Brick Road" is the main road leading to the Emerald City core area of Oz, to meet the "Wizard."
The so-called "Yellow Brick Road" is what we use to describe the path that large model labs are taking and investing significant resources in. Problems like code generation, writing, and image creation are naturally suited for labs to solve because they get better as the model's raw capabilities improve: every dollar invested in pre-training and fine-tuning directly enhances product quality.
But in other parts of Oz, there are more complex, and usually more vertical, problems. They do not simply offer a horizontal tool to an enterprise user to plug into standard tooling and compute capabilities to solve. The value here comes more from the scaffolding around the model: this scaffolding makes the output trustworthy, compliant within a specific industry, and truly integrated into business processes. The underlying model's raw capabilities are still important, but not everything.
We are seeing this in real-time. OpenAI and Anthropic are actually acknowledging to the market: they cannot solve all problems with a generic AI colleague. They have announced large-scale frontline deployment joint ventures focused on configuring and customizing models for enterprises to build entire companies. If they truly believed that the next model release could solve these problems, they would not be investing billions of dollars in such projects.
So, if you want to make money by building AI applications, don't take the Yellow Brick Road, go to other parts of Oz to build. Here are some lessons we and some founders in our portfolio have learned in practice.
If you are starting a company, the Yellow Brick Road is the most conspicuous but also the most dangerous path. Take a high-performance model, connect it to some off-the-shelf connectors like Google Drive, Slack, Salesforce, Notion, GitHub, and then build an intelligent body orchestration layer on top. It looks like magic.
The problem is, this is exactly what large model labs are doing through Cowork and Codex. Clearly, they have the model, which means they have better margins, more control, and pricing power over all downstream participants. But perhaps more importantly, they also have the architecture choices to decide what problems the product is suitable to solve. So far, they have been very deliberately adopting a "model + tool invocation" pattern, which is precisely the pattern needed for those horizontal, low-step-count tasks on the Yellow Brick Road. Even if a startup were somehow to outperform Codex or Claude Code, the large model labs still have massive distribution capabilities and the strongest brand halo in the AI space.
If you are an AI application company that follows the same playbook: accessing the same connectors, without underlying subagents or configurations, and without distribution channels, then you are most likely on a path to nothingness.
For startups, the situation is not entirely bleak. Beyond the yellow brick road, there are still huge opportunities. Startups can have customers in these places and solve complex problems.
These companies are building an agent-centric experience: models are woven into complex tools, automation, and integrated networks—basically software. This also makes most of these startups naturally vertical. They can focus on multi-step, multi-stakeholder workflows, design subagents for different roles and vertical scenarios, and tackle problems that are hard to reach by Anthropic and OpenAI's horizontal platforms: collecting context across systems and routing tasks to multiple approvers at different stages.
This type of work usually involves one or more legacy systems, often requires deterministic outcomes because fuzziness is unacceptable, and sometimes is directly tied to a key business result. Big model labs certainly understand how valuable these problems are: that's why they are building their own outsourced configuration teams, and that's why an entire cohort of enterprise reinforcement learning services companies is emerging.
A counterpoint to the above view is: so far, betting on models or labs to not continue advancing has been a terrible deal. They are likely to keep getting stronger and eventually eat into the markets served by these application-layer companies.
Big model labs will certainly keep advancing. But I believe that companies in other places in Oz still have several defenses in the long run.
Many things you truly internalize in your business are not present in any training set: unwritten industry conventions, undocumented standards, tribal knowledge residing only in practitioners' minds. They are not on the public internet. No matter how much training compute you pour in, it cannot substitute for truly getting inside these knowledge-rich workflows.
Here, two flywheels stack: one is the cross-customer flywheel, where patterns compound as you see more variants of the same issue; the other is the intra-customer flywheel, where the reasons behind specific decisions, the unsaid exceptions, the company's experiential rules only reveal themselves in real user-system interaction.
Even though customer data cannot be shared across customers, an app company can still leverage pattern recognition of different types of customer issues and use it to guide the architecture design for future issues. If a company has already had its AI handle a hundred legal redline edits, a thousand rounds of insurance underwriting, or ten thousand SDR sales development activities, its understanding of the nature of the problems is not something a newcomer can replicate with a first-time AI launch.
In theory, a horizontal AI could also build the same learning infrastructure. However, the reason it does not do so, apart from lack of focus, is primarily the user experience. Capturing this knowledge entirely depends on what kind of workflow interface you provide to the user. Vertical players can design these interfaces around the information that a specific workflow truly needs to expose, something that horizontal tools cannot achieve. Evaluation sets, annotated outputs, boundary case classification systems can all compound into a vertical domain data flywheel and further support fine-tuning. It is difficult for newcomers to generate such a flywheel without an equally large-scale production environment exposure. The feasibility of this depends on data rights, the accumulated production usage volume, and customer contract structures, but pattern recognition itself continues to accumulate.
Large-model labs are already doing routing internally: calling different categories of models for different requests and using model ensembles at the base level. However, what they cannot do is cross-vendor routing, and it is challenging to evaluate competitors' models for a specific sub-task or use the most suitable open-source fine-tuned model for a narrow segment.
Companies elsewhere in Ozland will choose the most suitable model for each sub-task throughout the model marketplace, not just using a model released by some mother lab. They also undertake tasks that nobody wants to do: re-run evaluation every time a new model is released, recalibrate prompt words for customer boundary cases, go live without disrupting the production environment. Large-model labs do not do these things for customers. They sell you the new model and then tell you to migrate. Companies elsewhere in Ozland absorb the migration costs. Customers receive the best intelligence capabilities in the entire market and continuity with every upgrade.
Throwing every query to Opus 4.7 is the fastest way to turn gross margin negative. The best companies in Ozland will route queries between different levels of models: the hardest tasks go to cutting-edge models, most tasks go to mid-level models, and in proven areas, smaller custom models or fine-tuned models are utilized.
Some companies are now doing their own retraining on top of this, optimizing the model for the tiny chunk of work that customers truly care about and providing services at a cost much lower than a cutting-edge API call. Large-model labs price at the "floor price": the lowest intelligence level that can be bought for X dollars. What companies in Ozland sell is the opposite: achieving the lowest dollar cost at the intelligence level truly needed for a specific workflow. This is only possible when you are very clear on what level of intelligence each sub-task truly needs. Large-model labs structurally cannot understand every task in every vertical industry. Ultimately, this directly translates into lower and more controllable pricing of outcomes.
Becoming the control plane for AI running in a vertical is quite valuable. This control plane is where permissions, auditing, what agents are allowed to do, and what agents actually did come together.
This control plane is built on top of the guardrails of a specific use case, which are entirely different across industries and job types. Because these companies own the tools, workflows, and data end-to-end that agents touch, they can provide deterministic outcomes in a way that horizontal players cannot. They also absorb regulatory complexity for the end buyer: US Federal Civil Rules of Procedure and rules of professional conduct in law, HIPAA in healthcare, SEC and FINRA rules in finance, state insurance regulations, and so on. Horizontal players can't credibly do this without turning themselves into a hundred verticals. What CIOs need is a partner that can commit in a contract to bear compliance-handling responsibility for the agents they provide.
All of this ultimately comes back to one thing: focus.
This focus can be a vertical industry like insurance, law, accounting; or a function done deeply enough like sales, customer service, or finance. Whichever it is, the work requires a team to be entrenched in a customer archetype long-term, understanding its workflows, edge cases, and regulatory requirements. Big model labs aren't built for this. They must serve everyone, everywhere, which is why they were originally architected. The same trade-offs also make them less capable to venture to other parts of Oz: You can be ubiquitous, or you can be excellent at one thing, but you can't be both.
In practice, what does this mean? Here are some practical tips from 11x CEO Prabhav Jain.
Building a company that can withstand the impact of big model labs, a viable tactical path, is to start with outcomes that customers truly care about. For us, that outcome is helping companies generate more sales leads and pipeline.
From here, things get very specific: Which activities do we want end-to-end ownership of that truly move the needle on pipeline growth? Break down each activity into tasks. Which tasks are suitable for agents, and which are not? Which require complex domain insights, and which do not? Big model labs also roll out workflows, but having a better model alone doesn't get things done when a workflow step is complex, inputs are messy, states are opaque, or real-world constraints exist. This is where the work reverts back to traditional software engineering, and at that level, big model labs have no advantage over a focused applied company.
For example, some of the tasks we handle include: prospecting based on custom signals, lead enrichment, deep account research, pulling context from CRM, crafting messages for different channels, lead qualification AI, and email deliverability. Some of these are AI tasks, some are not. These tasks cannot be completed with a single prompt; they require deep engineering capabilities.
The key insight from the analogy of the Land of Oz is: in any real-world workflow, roughly half of it consists of non-AI tasks, and this half does not bring the advantages of a lab environment. Below the modeling layer, their ability to write deterministic software is no better than yours. The other half, AI tasks, still require you to optimize, train, and constrain the model around the desired outcome.
Domain knowledge is often not found in general training data. These capabilities must be built bottom-up from vertical industry or specific functional expertise and fed to the model at the right moment in the workflow. When our AI judges an inbound lead over the phone for qualification, it must be trained to understand what constitutes a good sales conversation for a specific industry and user profile. This is the work that application companies have to do, and this capability will compound.
More importantly, these capabilities will continuously become outdated as the business itself evolves. Therefore, your ability to evolve workflows and context will itself become a competitive advantage. For example, when we first started doing scalable email outreach, "AI-written emails" were just emerging. Fast forward to today, people have developed a keen sense to distinguish which emails are AI-written and which are more human-like, and crucially, this judgment changes every few months. Our AI must continuously adapt to market dynamics, but it is also where the moat is built. In fact, despite this dynamic change, our response rate has increased fourfold in the past few months, creating a multi-billion-dollar sales pipeline for customers.
Complex problems are where true business value is unlocked. Otherwise, you easily find yourself just creating a thin veneer.
Breaking down any sufficiently complex business problem quickly reveals chaos. Here's a seemingly simple example from the GTM field: if a company is already your customer, you should not reach out to someone in that company again. But this is far from simple.
Perhaps you have the domain of that company in your CRM. So what about companies with dozens of subsidiaries? What if the CRM records the parent company's domain? What if an outdated mapping field in Salesforce leads you to cold email the Chief Revenue Officer of an existing customer? Real-world data is messy. Humans struggle with it, and models won't magically cross this threshold. To impose order on this chaos, you need to design specialized AI around the specific shape of the problem, rather than just pointing a generic co-pilot at the CRM and calling it a day. In fact, based on the data we have, we found our data quality and freshness to be higher than that of the customers themselves; thus, by default, we anchor to our own data.
Railing has been severely underrated. Even within the same product, each use case needs its own railing. For us, a regulated financial services prospective customer requires a completely different set of assurances compared to a mid-sized SaaS customer. And these assurances trickle down to how the intelligence is written, who can be contacted, what data can be touched, what can be said over the phone, and how each decision is recorded.
A "one-size-fits-all" system will crumble in the face of these differences. Railing must be built by use case, configured by customer, and continuously audited, with all this work falling entirely on the application company. That's why we need frontline deployment engineers and technical deployment strategists to fine-tune for each customer's requirements.
For example, we once worked with a Fortune 1000 institution to conduct opt-in outbound calling to their vast SMB customer base via voice. In the initial rounds, the answer rate was very low. We had to iterate quickly and learn how to engage this specific audience within the first 10 seconds of the call. The behavior of an SMB business owner is completely different from that of a large B2B buyer or consumer. Now, the number of sales opportunities we create for them in a day exceeds what their entire sales team could generate in that segment in a month.
Sales is just one example. Insurance is another example that illustrates the same point from a different angle. Here is FurtherAI CEO Aman Gour's take on "leaving the yellow-brick road construction."
As we began deploying AI into real insurance operations, we repeatedly heard an assumption: the model is the intelligence, and the workflow is just scaffolding built around the model.
But the more insurance companies we partner with, the more we are convinced it is quite the opposite.
In the insurance industry, much of the intelligence itself exists within the workflow. Two insurance companies can make a submitted document go through what looks like the same path: submission, review, quoting, underwriting. The path itself is easy. What truly distinguishes the two insurance companies is everything inside the path: which risks need escalation, which loss signals are crucial, which underwriting preference rules take precedence when two conflict, when a human signature confirmation is mandatory, what external data needs to be pulled, and how the final decision is recorded.
This logic does not reside in a clean rule engine. It is scattered across standard operating procedures, manager reviews, underwriting philosophies, insurance company-specific risk preferences, and years of operational experience. Much of it is not written in a form a model can directly read.
errorUtility products can also generate real revenue, but large-model labs are more likely to take it away because customers do not rely on you as an orchestration layer. A high ACV is often a signal of a systemic product, as systems replace real human effort and can therefore command corresponding payment. However, this is not an absolute guarantee. You need to ask yourself: If a large-model lab launches a product that seems to directly compete with yours, do customers still need your tool? If the answer is yes, you are building a system. If the answer is no, you are a tool—even if your ACV is high.
The performance of large-model labs is judged by benchmark testing; the performance of companies elsewhere in Oz is judged by their customers' P&L.
Customers do not care how your model scored on SWE-Bench or MMLU. They care about whether your agent closed deals, properly redlined contracts, and underwrote the right policies. If customers care about specific workflow outcomes rather than generic capability scores, you are in Oz elsewhere. If customers pay for generic capability, then you are selling what they can get through a seat at Claude or Codex.
The best AI companies need to perform like hedge funds: they win on alpha, and alpha is measured in customer P&Ls, not benchmark test scores.
We will see massive winners both on the yellow brick road and beyond. Models will continue to win because they possess the models and the distribution capability designed for horizontal tools.
Companies elsewhere in Oz can also win as long as they have working systems: the interfaces where the actual work of the business takes place, along with the data that flows through and is captured. These companies have data capture, workflow action systems, and governance. As complex workflows in a vertical mature, they coalesce into a core experience that customers cannot live without. As both existing players and upstarts continue to release next-gen models, this company will be the layer that integrates and delivers those models to customers. The underlying models are replaceable, but the working systems are not.
The next generation of enterprise software will be built beyond the yellow brick road.
Welcome to join the official BlockBeats community:
Telegram Subscription Group: https://t.me/theblockbeats
Telegram Discussion Group: https://t.me/BlockBeats_App
Official Twitter Account: https://twitter.com/BlockBeatsAsia