NewsFlash Articles Data Fundraising Skill&API

Muon Quietly Starves 25% of Neurons: Aurora Boosts Data Efficiency Hundredfold

According to Perceive Beating monitoring, Tilde Research has discovered a hidden flaw in the optimizer Muon, used by top models such as DeepSeek V4, Kimi K2.5, and GLM-5: it causes over a quarter of the neurons in the MLP layer to permanently die early in training. Based on this, the team has designed an alternative optimizer called Aurora and open-sourced it. A 1.1B model, using only about 100B tokens, has matched the performance of models trained with 36T tokens on language understanding benchmarks like HellaSwag and Winogrande, including Qwen3-1.7B.

The issue lies in a mathematical property of Muon when handling the MLP weight matrix. In the early stages of training, some neurons happen to receive weaker gradient signals. Traditional optimizers like AdamW normalize per parameter, naturally flattening this discrepancy; however, Muon's orthogonalization step passes weak signals down unchanged. The weak neurons receive persistently weak updates, becoming increasingly silent, creating a "rich get richer" feedback loop. By the 500th step, over a quarter of the neurons have substantially died, wasting parameter capacity.

An earlier improved version, NorMuon, mitigated this by forcefully flattening the update magnitude of each row, but at the cost of breaking the orthogonality of the update matrix (orthogonalization makes each update step as efficient as possible, a core advantage of Muon), resulting in a loss of optimization accuracy. Aurora imposes a joint constraint of "uniform updates" and "orthogonality," satisfying both concurrently through alternating iterations: ensuring each neuron receives a fair learning opportunity while not sacrificing update precision.

Untuned Aurora incurs only a 6% increase in computational overhead compared to Muon and can be directly swapped in. In optimized-nanoGPT benchmarking, Aurora set a new record with 3175 steps. Aurora's advantages will further amplify with increasing MLP width; the higher the expansion factor, the more pronounced the improvement.

The code and the 1.1B pre-trained model have both been open-sourced.

Source

Correction/Report

On-Chain Activity