header-langage
简体中文
繁體中文
English
Tiếng Việt
한국어
日本語
ภาษาไทย
Türkçe
Scan to Download the APP

Why Can't the Large Language Model Write Out "Ma Jiaqi"? MiniMax Full Vocabulary Scan Reveals Nearly 5% of Tokens Are Forgotten in Post-Training

According to Perceive Beating monitoring, MiniMax has released a technical blog revealing the root cause analysis process of its M2 series large model's inability to output the name "Ma Jiaqi." The investigation started from individual cases and eventually uncovered a systemic degradation issue affecting the entire vocabulary.

The root cause was identified as the tokenizer (a component that splits text into model processing units) merging "Jiaqi" into a single token during training. During the pre-training phase, the model encountered a large amount of internet text and learned this token; however, in the later dialogue data used for fine-tuning, there were less than 5 samples containing "Jiaqi." During fine-tuning, high-frequency tokens such as tool_call markers and code symbols continuously updated the surrounding vector space, pushing low-frequency tokens like "Jiaqi" in the wrong direction. The model still "recognizes" Ma Jiaqi and can accurately answer related questions; the only thing lost was the ability to output this token.

Subsequently, the team conducted a full scan of the entire vocabulary of around 200,000 tokens and found that about 4.9% of the tokens had significantly degraded. The most severe degradation was observed in Japanese: 29.7% of Japanese tokens significantly degraded, far exceeding Korean at 3.3%, Russian at 3.7%, Chinese at 3.9%, and English at 3.5%. Also ranking high in degradation were internet SEO junk words like "private server of legend" and "painless induced abortion," with mechanisms identical to "Jiaqi."

The significant degradation of Japanese also solved an old mystery. Previously, the model occasionally mixed in Russian or Korean characters in Japanese conversations, and the reason could not be found. This analysis revealed that after the Japanese token parameters drifted, they got confused with tokens from other languages in the vector space, leading to both the incorrect activation of Japanese tokens (language mixing) and pushing neighboring low-frequency Chinese tokens out of the normal probability range (token forgetting).

The solution was to construct a synthetic dataset covering the entire vocabulary and train the model with a simple echo task for each token. The results were immediate: the proportion of Russian characters blended into Japanese responses dropped from 47% to 1%, and the overall vocabulary output parameter stability (cosine similarity) increased from a minimum of 0.329 to all above 0.97.

举报 Correction/Report
Correction/Report
Submit
Add Library
Visible to myself only
Public
Save
Choose Library
Add Library
Cancel
Finish