According to DynaBeat monitoring, MIT Hugging Face team has released the language diffusion model ELF (Embedded Language Flows). It does not follow the GPT-style autoregressive "predict the next token" approach, but instead completes text generation in a continuous embedding space until the final step, where it is converted back to discrete tokens.
The diffusion model has been well-established in image generation, but applying it to text has always been awkward: while images are naturally continuous signals, language is composed of discrete tokens. Many previous continuous diffusion text models either repeatedly introduce token-level supervision in the generation trajectory or require additional independent decoders. ELF's approach is cleaner: most steps only denoise in continuous vector space, and the final step discretizes using a shared-weight network.
The experimental results are also impressive. In the OpenWebText unconstrained generation evaluation, the 105M parameter ELF-B achieved around 24.1 Gen. PPL with 32-step sampling, outperforming various discrete and continuous diffusion language model baselines. More importantly, ELF-B only used about 45B training tokens, while compared methods usually exceed 500B, reducing the training tokens by about an order of magnitude. At the very least, this result indicates that the continuous diffusion approach in language modeling has not been blocked by the "discreteness of language," with previous issues likely stemming from modeling interfaces and sampling designs.
