How important is Whole Word Masking for Masked LM training?

John6666 · October 28, 2025, 2:13am

mask entire words, vs. the word-parts

Seems the word-parts?

Use dynamic subword masking by default. Whole-word masking (WWM) is a minor lever for English BPE/WordPiece encoders during topic-specific continued pretraining. Gains are task-dependent and usually small. Span masking often delivers clearer wins on span-centric tasks. Most improvement comes from more in-domain text and adequate training steps. (ar5iv)

What WWM changes

WWM does not change the loss. It only changes which tokens get hidden. With BERT-WWM, all subpieces of a word are masked together, yet each subpiece is still predicted independently. This is a selection policy, not a new objective. (huggingface.co)
The HF DataCollatorForWholeWordMask depends on BERT’s ## WordPiece convention. With other tokenizers it collapses to standard token masking unless you provide word boundaries yourself. (huggingface.co)

Why it can help

It removes “leakage.” Predicting a suffix is trivial if the stem is visible. Masking the whole word increases difficulty and pushes the model to use surrounding context. SpanBERT generalized this idea by masking contiguous spans and showed consistent gains on QA and coref. (ACL Anthology)
WWM is proven viable. Google released English BERT checkpoints trained with WWM, and these are standard MLM with a different masking selector. (huggingface.co)

Why it is often low-impact for English BPE/WordPiece

RoBERTa’s improvements came from scale and dynamic masking, not from WWM. The authors also report they tried WWM and did not see benefit, and suggested exploring span-based masking instead. (arXiv)
For domain adaptation, continued pretraining on in-domain text (DAPT/TAPT) explains most of the downstream gains. Masking strategy is a second-order knob. (arXiv)
Effect sizes grow in languages with heavy multi-character words or explicit segmentation issues. Chinese WWM and MacBERT report stronger improvements than typical English setups. Recent Chinese encoders still adopt WWM variants. (arXiv)

Your three points, assessed

“Most words are single tokens under RoBERTa BPE.” Often true for frequent words. Hence WWM affects mainly rare or compound domain terms. RoBERTa itself relied on dynamic masking and scale. (ar5iv)
“Masking half a word still teaches something.” Correct, but the unmasked half leaks the answer. Span or WWM reduces this shortcut and forces contextual reasoning. (ACL Anthology)
“The model thinks in tokens, not words.” Correct. WWM only alters the token sampling pattern. The objective and architecture stay the same. (huggingface.co)

What to do on a topic-specific corpus

Default recipe. Run DAPT/TAPT with dynamic masking. Track downstream dev metrics, not only MLM loss. This yields most of the gain. (ACL Anthology)
Try WWM when: your jargon splits into multiple subpieces often, or your task cares about exact spans (NER, extractive QA, coref). Expect modest gains. Compare directly. (ACL Anthology)
Prefer span masking for span tasks. It targets the same leakage problem and shows stronger, repeatable improvements. (ACL Anthology)

Implementation notes for RoBERTa and other BPE tokenizers

HF’s built-in WWM collator is BERT-specific. For BPE tokenizers, supply word boundaries via word_ids() or offset mapping, or write a custom collator. (GitHub)
Keep word boundaries in the batch. Set remove_unused_columns=False so word_ids reach the collator when using Trainer. This is a common pitfall. (Hugging Face Forums)

Pitfalls and checks

Verify dynamic masking is enabled at training time. RoBERTa’s recipe depends on it. Do not precompute static masks. (arXiv)
Unit-test edge tokens such as numbers and hyphenated forms when grouping BPE pieces into words. Byte-level BPE can split punctuation in non-intuitive ways. (huggingface.co)

Minimal ablation plan

Baseline: DAPT with dynamic masking.
Add WWM: group subpieces by word_ids. Keep the same overall 15% mask rate.
Add span masking: contiguous spans with similar token-level budget.
Fine-tune on your task. Compare F1 or EM, not just MLM loss.

This isolates WWM’s value on your corpus and task. (ACL Anthology)

Curated references with context

Core papers

RoBERTa. Dynamic masking and scale drive gains. Baseline for English encoders. (arXiv)
SpanBERT. Span masking reduces leakage and improves QA and coref. Use when spans matter. (ACL Anthology)
Don’t Stop Pretraining. Most domain gains come from continued in-domain pretraining. Do this first. (ACL Anthology)
Chinese WWM and MacBERT family. Clearer WWM benefits. Useful contrast if your domain has many multi-piece “words.” (arXiv)

HF ecosystem and issues

BERT-WWM model card. Confirms identical MLM loss and WWM as selection policy. (huggingface.co)
Data collator docs. Note about BERT ## dependency and fallback behavior. (huggingface.co)
HF forum tip. Keep remove_unused_columns=False to preserve word_ids. (Hugging Face Forums)
Fairseq RoBERTa issue. Authors: WWM “didn’t seem to help”; span masking suggested. (GitHub)

Bottom line

Treat WWM as an ablation, not a default. Use it when your domain terms split often or when spans are central. Otherwise stick to dynamic masking and invest effort in data, steps, and evaluation. (ar5iv)

Topic		Replies	Views
Best way to mask a multi-token word when using `.*ForMaskedLM` models 🤗Tokenizers	2	2333	April 4, 2022
Mask modelling on specific words Beginners	1	1064	March 25, 2021
How to use whole word masking data_collator? Beginners	8	3159	June 15, 2024
How to train a LM model with whole word masking using Pytorch Trainer API 🤗Transformers	0	303	July 4, 2022
Whole-word masking for T5 Beginners	2	553	November 28, 2023