According to Dongcha Beating monitoring, Nous Research has open-sourced the Lighthouse Attention, a long-context pre-training mechanism. When processing 512K-length text on a single B200 GPU, this solution's computation speed is about 17 times faster than the traditional mechanism and achieves a 1.4 to 1.7 times end-to-end training speedup at a length of 98K.
The traditional attention mechanism requires computing the pairwise relationships of all words. As the text gets longer, the computing power required increases exponentially. Lighthouse Attention takes a different approach by screening first and then refining the calculation. It rapidly browses compressed summaries of the text at different levels, selects core segments based on scores to form short texts, and then directly processes them using the efficient FlashAttention operator. Because the screening logic is completely decoupled from the kernel, developers are spared the hassle of writing low-level code and do not need to add extra training objectives.
Past acceleration solutions using a similar approach often had side effects. After the model gets used to skipping during reading, it can easily lose its original ability to read word by word. To avoid this pitfall, the research team has the model first run most of the training in acceleration mode, only briefly switching back to the traditional full attention calculation at the end of training for a brief adaptation. In tests on a 5.3 billion-parameter model fed with 500 billion Token training data, the model trained in this way not only significantly reduced training time but also ultimately performed equally well as or even outperformed the baseline version trained throughout using the traditional method.
