According to 1M AI News monitoring, Apple's machine learning research scientist Shuangfei Zhai has published a paper proposing "Exclusive Self-Attention." The change is simple: in the standard Transformer, each token includes its own information when computing attention; XSA forcibly excludes the contribution from its own position, extracting information only from the context. Intuitively, a token already knows itself, and the value of the attention mechanism is to inform it about its surroundings.
Experimental results consistently outperform standard self-attention within a maximum 27 billion parameter scale, with the advantage becoming more prominent as the sequence length increases. Zhai was also the author of the Attention Free Transformer (AFT) and continues to explore alternative solutions to the attention mechanism.
