According to Perceive Beating monitoring, Tencent Hyaline and SSV Digital Culture Lab, in collaboration with institutes such as the CAS Institute of Information Engineering, have officially launched the first ancient character perception evaluation benchmark covering the "Seven Transformations" called Chronicles-OCR. This benchmark contains 2,800 images cross-annotated by experts, quantifying for the first time the recognition difficulty of seven fonts from Oracle Bone Script to Cursive Script.
The research team evaluated 28 mainstream multimodal large language models, and the results show that they were almost entirely defeated when dealing with ancient fonts. In the cross-era character detection task, the core metrics of GPT-5 and Gemini 2.5 Pro approached 0, with the best-performing model reaching only 16.5. Even when directly drawing boxes on the image to skip the localization step, the highest accuracy was only 27.1%, with Gemini 3.1 Pro achieving an accuracy of only 14.0% on Oracle Bone Script.
This confirms that modern models heavily rely on the structured priors of modern layouts. Faced with ancient physical media that is unconstrained and highly noisy, the model's text segmentation mechanism fails outright. The font classification results further indicate that the models often recognize the carrier texture (such as turtle shell or bronze rust) rather than the actual character strokes.
The experiment also revealed a counterintuitive phenomenon: engaging in thought mode actually results in a decrease in ancient character recognition rates. A comparison shows that almost all models supporting this mode exhibit degraded performance when thought mode is activated. When the underlying visual perception is absent, the chain of thought not only fails to correct errors but instead becomes an illusion amplifier, outputting highly confident incorrect answers.
