According to Watchful AI monitoring, in the large-scale evolution of the MoE architecture, utilizing domestic Ascend chips for training large models has become a key direction for building autonomous and controllable AI computing power. However, most mainstream large model frameworks are built on the NVIDIA CUDA ecosystem. Direct transplantation to the Ascend platform may face challenges such as uneven hardware queue scheduling and low computing power utilization. The University of Science and Technology of China, Huawei, and Peking University have jointly launched the HyperParallel-MoE compilation and scheduling framework. This framework targets the unique hardware queue of Ascend A3 for tile-level control, aiming to overcome the energy efficiency bottleneck of heterogeneous computing power in parallel scheduling.
The Ascend A3 has two types of cores, with AIC responsible for matrix multiplication and AIV handling vector computation and communication. However, under traditional operator serial scheduling, the two types of cores can only work alternately, leading to idle time. Test data shows that when running a 671B DeepSeek-style large model on a 256-node cluster, the AIC utilization rate is only 67%, with 39% of expert routing communication latency exposed on the critical computation path.
The key modifications in HyperParallel-MoE include three aspects. First, the design of an AIV-driven unilateral write primitive allows computation to trigger as soon as data tiles arrive, without waiting for the entire batch to assemble. Second, the introduction of dependency-aware tile task generation unifies communication and computation operators under a common abstraction. Third, a static scheduler pre-generates task sequences to drive the two types of cores in parallel within a single kernel and utilizes high-speed L2 cache to share intermediate results, reducing latency from writing back to and reading from slow HBM memory.
Tests have shown that under a 64-node balanced routing scenario, the core module responsible for expert computation (MoE-FFN) achieved a latency reduction of approximately 36%, equivalent to a maximum 58% increase in data processing speed (speedup from 1.49 to 1.58 times). In end-to-end system operation, single-step training speed also improved by 8% to 9%. This indicates that the actual efficiency of Ascend not only depends on hardware specifications but also on whether the compiler and runtime can efficiently schedule AIC/AIV cores.
