NewsFlash Articles Data Fundraising Skill&API

Data Flywheel or Duplicate Samples? Physical AI Bids Farewell to "Hourly Worship"

律动BlockBeats

Read this article in 16 Minutes

The robotics company is still chasing hours, but what they really lack is fresh blood.

TL;DR
· Roboticist Animesh Garg questions the industry's use of teleoperation hours as a proxy for model capability.
· Robotic data collection is costly, and deploying data often comes from narrow scenarios, leading to quickly escalating costs for redundant samples.
· More valuable may be long-tail failures, task coverage, and novel samples, rather than total operation time.

Formerly of, and currently at Georgia Tech, roboticist Animesh Garg, in an article titled "Moneyball for Physical AI," likened the data competition of embodied intelligence to baseball's "Moneyball" moment.

What he aims to challenge is an increasingly common funding narrative: that robotics companies can spin up a data flywheel simply by piling on more teleoperation, more real-world deployment, more runtime. For investors, this is not an academic skirmish. The cost structure, commercialization velocity, and model moats of embodied intelligence companies are often bundled into the four words "data flywheel." If cumulative hours don’t equate to meaningful model progress, the market needs to reconsider these companies' data assets.

The "Data Hour" Metric Might Be a Superstition in the Robotics Industry

Garg borrowed from the classic analogy of "Moneyball." In 2002, the Oakland Athletics won 103 games with one of the lowest payrolls as a team, not by buying more expensive players, but by finding market inefficiencies in player valuation. Traditional scouts valued batting average, stolen bases, and posture, but the statistical indicator that best explained a team's scoring capability was on-base percentage.

In his view, Physical AI may also be in a similar phase. The industry acknowledges data as a must-have on the path to a general-purpose robot model, yet it tends to take the most readily demonstrable metrics as the most critical: cumulative teleoperation hours, number of teaching trajectories, deployed robot count, production scene runtime.

The supply of robot data and text data is different. Large language models can acquire massive amounts of low-cost text from the internet, code repositories, books, and web pages, with bottlenecks more on compute power, cleaning, and training efficiency. Robot models require data with physical interaction, action feedback, and environmental changes, with every hour of valid data having to be truly generated, each corresponding to equipment, manpower, space, sensors, failure handling, and safety costs.

Roboticist Ken Goldberg once used the term "100,000-year data gap" to describe the disparity between robot and internet-scale AI data. More accurately, the text and image data consumed by contemporary large-scale vision language model training, if converted into human reading or viewing time, equate to about 100,000 years, while robots lack an equivalent scale of real-world interactive data. This statement is not about setting precise thresholds for robot models but about reminding the industry that real-world interactive data cannot be fetched at low cost like web text.

This is also why Garg is against the "sweatshop-style remote operation" narrative. While a large amount of manual remote operation can indeed provide action-packed training data, if a company evaluates data based solely on total hours, funds may flow to repetitive, low-difficulty, and low-information-density samples rather than scenarios that would most effectively reduce failure rates.

Three Types of Data Buy Different Things

In Garg's classification, Physical AI data is roughly divided into three categories: observation data, intervention data, and deployment data. They all can be useful, but they vary significantly in cost, constraints, and information density.

The first category is observation data, such as first-person or third-person videos. Its advantage is low cost and wide coverage, which can help models understand objects, space, action outcomes, and environmental distribution. The downside is also clear: the model can see what a person or object is doing, but may not necessarily know what action the robot should take in a certain state.

The second category is intervention data, which includes trajectories from teleoperation, teaching, and human-in-the-loop. This type of data is more directly beneficial for robot training because it contains a chain of "what is seen, how to move, and what happens after the move." The trade-off is that high-quality trajectories are costly to acquire, and the costs of labor and equipment are not likely to decrease as rapidly as software data.

The third category is deployment data, which is telemetry data generated when a robot operates in a real-world commercial setting. It sounds closest to a business flywheel: the robot works, makes money, and generates training data simultaneously. However, there is a statistical trap here.

Today, the first robot deployments usually occur in environments with minimal variation, highly structured processes, and well-controlled risks, such as highly structured warehouses, factories, or single-task environments. The volume of this production data may be significant, but the distribution is narrow, and the repetition is high. Once the model learns local patterns, the additional information gained from each additional hour of operation decreases.

Deployment data is not without value. What is truly valuable is often not the numerous "task success" routine segments but the failures, stalls, anomalous objects, edge cases, and rare perturbations. The challenge is that these tail samples do not appear at a stable pace as desired by the company, and the costs of discovery, filtering, and post-mortems are higher.

More Data is Useful, but Duplicate Samples Quickly Become Expensive

Garg is cautious about borrowing from the language model scaling law: increasing data usually leads to decreasing model losses but diminishing returns. If samples are repetitive, nearly identical, or from the same narrow distribution, the help from new data diminishes more quickly.

In the context of robotics, this issue is even more apparent. A robot learning to pick up fixed packages from fixed shelves may find the initial thousands of training, failure, and correction instances highly valuable. Once actions, objects, lighting, and paths have been extensively captured, additional data becomes more like replicating previously learned local experiences.

There has been similar experience in language model training: repetitive and near-duplicate data will waste training budget, and excessive repetition may also harm generalization. Garg did not directly apply these conclusions to robot training, but used them to illustrate a direction: measuring the value of data cannot only be based on quantity, but also on how much difference there is between samples.

For Physical AI, diversity has at least two meanings. The first is to expose the model to more objects, spaces, materials, lighting conditions, occlusions, and manipulation methods. The second is to prevent the model from performing well in a too simple task distribution and then failing in slightly different scenarios.

As a result, tail-end failure cases have become crucial. The real physical world is not uniformly distributed, low-frequency anomalies often determine commercial viability: objects slightly misaligned, packaging deformed, surface reflections, gripper slippage, human intervention, missed sensor readings, and changes in ground friction. No matter how well the model performs on regular samples, if it cannot handle these tail events, deployment will still be hindered by occasional failures.

Establishing a Deployment Flywheel Requires Early Scenarios to be Sufficiently "New"

What this article truly challenges is the common commercialization route for embodied AI companies: initially deploying robots in narrow scenarios, ensuring availability through human remote operation, while collecting production data, and then using this data to train stronger models and expand into more scenarios.

Garg refers to this type of path as a "neo-integrator" approach. It attempts to bypass the pure data collection cost, put robots into commercial production, have operational revenue offset data costs. Compared to setting up a dedicated teleoperation factory, this path sounds more efficient.

But establishing a flywheel has one prerequisite: the data generated from early commercial scenarios must be sufficiently new and diverse enough to help the model transition to more tasks. If the deployment scenario is only low in variance, low in entropy, and heavily engineered for a narrow task, the data will quickly saturate. The company may not end up with a general-purpose capability flywheel, but instead a set of custom projects that require continuous integration, maintenance, and anomaly handling.

This will incur two types of costs. First, for each new scenario entered, there must be investment in environmental modifications, process adaptation, failure fallbacks, and security mechanisms. Second, if the deployment itself has not yet reached breakeven, scaling up may not necessarily mean collecting data at a low cost, but could also mean exchanging losses for a large amount of low-novelty samples.

Therefore, early deployment is not useless, but needs to be examined more closely: how much new task coverage has been brought, how many failures and outlier samples have been generated, whether these samples can be transferred to other scenarios, and after deducting hardware, manpower, maintenance, and integration costs, how much model improvement can be obtained per dollar.

The Valuation Narrative Shouldn’t Just Ask How Many Hours Have Been Accumulated

Garg's suggestion is not to stop data collection, but to switch the evaluation focus. Cumulative running hours, teleoperation hours, and trajectory counts can serve as operational metrics but should not be directly equated with model progress.

More insightful questions include: when does data saturation occur for a single task, how much engineering integration cost is needed for adding a new task, to what extent does the data cover different scenarios and action clusters, how much of the production data is truly from distribution drift and outlier samples, how many routine successful segments in the deployment flow should be filtered out instead of being continuously fed to the model.

Corresponding to the three types of data, capital allocation will also vary. Observation data should prioritize low cost, diversity, and broad coverage to expand the boundary of foundational capabilities. High-cost teleoperation and teaching data should shift budget towards more tasks after reaching saturation for a single task instead of repeating the same action. Deployment data should focus on screening failures, edge cases, and out-of-distribution samples, discarding a large number of low information density routine operation records.

This set of viewpoints has practical implications for the valuation narrative of Physical AI. A company having more robots, longer runtime, a larger teleoperation team does not automatically mean having a stronger model moat. The harder-to-replicate capability might lie in consistently finding high-value long-tail data, determining when certain data saturates, and covering more task distributions at a lower cost.

However, this is still a capital allocation perspective, not an industry consensus yet. Whether robot models will experience scale benefits similar to language models, if deployment data can continue to generate new information in certain high-dimensional scenarios, and how efficient the transfer between different tasks is, all require more empirical results to answer.

Garg's reminder falls on a more specific question: the "golden metric" of Physical AI might not be the number of data hours but rather the novelty samples acquired per dollar. For robot companies still storytelling with the data flywheel, what the market might eventually look at is not how long the cumulative runtime has been but how much new information has actually been generated during that time.

Welcome to join the official BlockBeats community:

Telegram Subscription Group: https://t.me/theblockbeats

Telegram Discussion Group: https://t.me/BlockBeats_App

Official Twitter Account: https://twitter.com/BlockBeatsAsia

#data #Have

Correction/Report