Turning BERT’s MLM Into a Text Diffusion Generator

Read Articleadded Oct 20, 2025
Turning BERT’s MLM Into a Text Diffusion Generator

The author shows that text diffusion is a natural extension of BERT’s masked language modeling, where variable-rate masking and iterative denoising form a generative procedure. By fine-tuning RoBERTa with a 10-step masking schedule and preserving a prompt prefix, the model generates coherent text after short training. Results lag GPT-2 in coherence and speed, but demonstrate viability and potential for further optimization.

Key Points

  • Discrete language diffusion can be realized by variable-rate masking and iterative denoising, which generalizes BERT’s masked language modeling objective.
  • BERT/MLM is effectively a single diffusion step (fixed masking rate); training across multiple masking rates yields a full generative procedure.
  • A simple fine-tuning of RoBERTa with a 10-step masking schedule and a prompt-preserving prefix can generate coherent text blocks.
  • Inference proceeds by iteratively predicting masked tokens with top-k/top-p sampling and re-masking, analogous to diffusion reverse steps.
  • While GPT-2 remains more coherent and faster, optimizations (e.g., AR-Diffusion, Skip-Step Diffusion) could narrow the gap; related prior work (DiffusionBERT) supports the approach.

Sentiment

The overall sentiment of the Hacker News discussion is highly positive and appreciative of the article's insightful connection between BERT-style MLM and text diffusion. While acknowledging prior related work, the community largely agrees with the core premise and finds the parallel to be clear and intuitive, contributing additional context rather than direct opposition.

In Agreement

  • The core idea that discrete text diffusion is essentially a generalized form of MLM, or that BERT is a single diffusion step, is a 'very cool parallel' that 'makes complete sense.'
  • Many commenters independently had similar thoughts about the connection between text diffusion and MLM when such models first emerged.
  • The flexibility of a transformer architecture to adapt to different generative objectives is seen as 'amazing.'
  • The write-up itself is appreciated as 'fun' and insightful.

Opposed

  • The definition of 'text diffusion' might be more rigorously applied if models learned to replace semantically incorrect tokens with correct ones, rather than just masked tokens, to better align with the noise-resistance property of continuous diffusion models.
  • The connection between Masked Language Modeling (MLM) and diffusion, or the development of generative MLMs, has been noted in earlier academic work dating back to 1994, 2019, and 2021, contextualizing the article's observation rather than directly opposing its technical findings.
Turning BERT’s MLM Into a Text Diffusion Generator