According to Dynamic Beating monitoring, Anthropic disclosed in a Claude Fable 5 and Claude Mythos 5 system card that Mythos 5 exhibited strong expert-assist capabilities in biosafety assessment. In a plant pathology red team exercise, six biology PhDs were paired with large-model experts to use Mythos 5 to design end-to-end biological resistance solutions against a hypothetical engineered agricultural pathogen. Three teams included plant pathology experts, while the other three teams comprised general microbiology PhDs.
The results showed that within 16 hours, two out of the three general PhD teams surpassed all three expert teams in scientific quality and feasibility. Expert reviews estimated that without AI tools, completing these strategies and implementation protocols would typically require 40 to 95 working days, averaging around 72.5 working days. Anthropic believes this is one of the strongest single pieces of evidence of Mythos 5 nearing the CB-2 risk threshold, indicating that the model has already enabled general researchers to receive field-knowledge support close to that of world-class experts in some tasks.
However, this does not mean that Mythos 5 is already capable of autonomously conducting cutting-edge research. Anthropic also points out that the model still relies on human expert filtering of ideas, has weak open-ended conceptual capabilities, tends to remix existing literature into complex solutions, and rarely proposes truly novel pathways; it also tends to continue advancing along erroneous user-provided frameworks, even if flaws are detected in the proposed solutions.
This assessment is also echoed in the CUSP scientific prediction benchmark. CUSP covers 4760 scientific events, evaluating the model's feasibility assessments of research progress, mechanism identification, solution generation, and time prediction. The results show that GPT-5.4 achieved 81.9% in four-choice mechanism identification questions, Claude S4.5 scored 72.4%, but in the binary task of judging whether scientific progress will truly materialize, the accuracy of each model ranged from 45.3% to 51.9%, close to random guessing. In other words, current large models are already very good at filling in partial scientific steps but still unreliably judge which scientific paths will truly succeed.
