According to Dynaction Beating monitoring, Sakana AI, in collaboration with NVIDIA, has open-sourced a sparse data format named TwELL and its accompanying acceleration kernel. This innovation enables GPUs to skip irrelevant computations that are "close to zero" when running large models. Without compromising model accuracy, this solution has boosted the H100's inference speed by up to 30% and training speed by up to 24%, while significantly reducing peak memory usage.
The feed-forward layer (FFN) of large models consumes the majority of parameters and computational power. However, in reality, during text generation, over 80% of neurons are in a "dormant state" (with activation values close to zero), making no contribution to the final output. By bypassing these neurons, significant computational power can be saved. Yet, modern GPUs are inherently skilled at computing dense matrices uniformly. Using traditional methods to extract scattered useful data incurs a heavy overhead from repeatedly searching for and reading data, nullifying the computational savings.
The TwELL format aims to break this hardware curse. It aligns perfectly with the GPU's parallel logic design: instead of piecing together non-zero data across regions as in traditional methods, it divides data into small tiles that GPUs excel at processing. As a result, each computing core of the GPU can locally pack useful data, eliminating time-consuming global memory reads and writes, seamlessly integrating into the modern chip's acceleration pipeline.
In a practical test of a 1.5 billion parameter model, by adding a slight regularization during training, the proportion of neurons that actually need computation was reduced to less than 2%, with no degradation in performance across seven downstream tasks. The data also revealed a pattern: the larger the model's parameter count, the more dormant neurons exist (the non-zero ratio of a 2 billion parameter model is 38% lower than that of a 500 million parameter model). This indicates that as future endeavors pursue larger-scale models, this hardware-centric optimization will unleash even more significant performance benefits.
