header-langage
简体中文
繁體中文
English
Tiếng Việt
한국어
日本語
ภาษาไทย
Türkçe
Scan to Download the APP

ARC-AGI-3 Announces Largest-Ever Human Test: All Levels Conquered by Humans, AI Still Lags

According to WatchBeat Monitoring, the ARC Prize Foundation has released the human performance dataset for ARC-AGI-3, the largest-scale human testing study in the ARC-AGI series to date, with a total of 458 participants. The dataset includes 342 complete human operation replay records, covering 25 public environments, all of which have been open-sourced.

ARC-AGI-3 consists of 135 abstract reasoning environments. Testers did not receive any gameplay instructions and had to explore, infer rules, and develop strategies on their own. The tests were conducted at an offline testing center in San Francisco, with each lasting 90 minutes. Participants received a base pay of around $130 plus a $5 reward for passing each environment. All tests were conducted under the "first-pass" condition, meaning each person saw the environment only once and attempted it only once, measuring their learning and adaptation abilities when faced with completely new problems. Both humans and AIs received the exact same information without any information asymmetry.

Key Findings: All environments in ARC-AGI-3 were passed by humans, with at least two independent participants completing each environment, and most environments having five or more passes. The ARC Prize Foundation stated, "We have not yet achieved AGI, and this dataset is the evidence for that."

Since the preview of ARC-AGI-3, the public environments have received nearly 1 million AI evaluation submissions. Based on this data, the foundation has announced two scoring rule adjustments simultaneously: first, changing the human benchmark from the "second-best player" to the "median player" per level to reduce the impact of luck on scores; second, raising the single-level score ceiling from 100% to 115% to prevent poor performance on one level from dragging down the overall score. The net effect of these two adjustments is a slight increase of about 0.5 percentage points in both human and AI scores.

举报 Correction/Report
Correction/Report
Submit
Add Library
Visible to myself only
Public
Save
Choose Library
Add Library
Cancel
Finish