mask entire words, vs. the word-parts
Seems the word-parts?
Use dynamic subword masking by default. Whole-word masking (WWM) is a minor lever for English BPE/WordPiece encoders during topic-specific continued pretraining. Gains are task-dependent and usually small. Span masking often delivers clearer wins on span-centric tasks. Most improvement comes from more in-domain text and adequate training steps. (ar5iv)
What WWM changes
-
WWM does not change the loss. It only changes which tokens get hidden. With BERT-WWM, all subpieces of a word are masked together, yet each subpiece is still predicted independently. This is a selection policy, not a new objective. (huggingface.co)
-
The HF
DataCollatorForWholeWordMaskdepends on BERT’s##WordPiece convention. With other tokenizers it collapses to standard token masking unless you provide word boundaries yourself. (huggingface.co)
Why it can help
-
It removes “leakage.” Predicting a suffix is trivial if the stem is visible. Masking the whole word increases difficulty and pushes the model to use surrounding context. SpanBERT generalized this idea by masking contiguous spans and showed consistent gains on QA and coref. (ACL Anthology)
-
WWM is proven viable. Google released English BERT checkpoints trained with WWM, and these are standard MLM with a different masking selector. (huggingface.co)
Why it is often low-impact for English BPE/WordPiece
-
RoBERTa’s improvements came from scale and dynamic masking, not from WWM. The authors also report they tried WWM and did not see benefit, and suggested exploring span-based masking instead. (arXiv)
-
For domain adaptation, continued pretraining on in-domain text (DAPT/TAPT) explains most of the downstream gains. Masking strategy is a second-order knob. (arXiv)
-
Effect sizes grow in languages with heavy multi-character words or explicit segmentation issues. Chinese WWM and MacBERT report stronger improvements than typical English setups. Recent Chinese encoders still adopt WWM variants. (arXiv)
Your three points, assessed
-
“Most words are single tokens under RoBERTa BPE.” Often true for frequent words. Hence WWM affects mainly rare or compound domain terms. RoBERTa itself relied on dynamic masking and scale. (ar5iv)
-
“Masking half a word still teaches something.” Correct, but the unmasked half leaks the answer. Span or WWM reduces this shortcut and forces contextual reasoning. (ACL Anthology)
-
“The model thinks in tokens, not words.” Correct. WWM only alters the token sampling pattern. The objective and architecture stay the same. (huggingface.co)
What to do on a topic-specific corpus
-
Default recipe. Run DAPT/TAPT with dynamic masking. Track downstream dev metrics, not only MLM loss. This yields most of the gain. (ACL Anthology)
-
Try WWM when: your jargon splits into multiple subpieces often, or your task cares about exact spans (NER, extractive QA, coref). Expect modest gains. Compare directly. (ACL Anthology)
-
Prefer span masking for span tasks. It targets the same leakage problem and shows stronger, repeatable improvements. (ACL Anthology)
Implementation notes for RoBERTa and other BPE tokenizers
-
HF’s built-in WWM collator is BERT-specific. For BPE tokenizers, supply word boundaries via
word_ids()or offset mapping, or write a custom collator. (GitHub) -
Keep word boundaries in the batch. Set
remove_unused_columns=Falsesoword_idsreach the collator when usingTrainer. This is a common pitfall. (Hugging Face Forums)
Pitfalls and checks
-
Verify dynamic masking is enabled at training time. RoBERTa’s recipe depends on it. Do not precompute static masks. (arXiv)
-
Unit-test edge tokens such as numbers and hyphenated forms when grouping BPE pieces into words. Byte-level BPE can split punctuation in non-intuitive ways. (huggingface.co)
Minimal ablation plan
-
Baseline: DAPT with dynamic masking.
-
Add WWM: group subpieces by
word_ids. Keep the same overall 15% mask rate. -
Add span masking: contiguous spans with similar token-level budget.
-
Fine-tune on your task. Compare F1 or EM, not just MLM loss.
This isolates WWM’s value on your corpus and task. (ACL Anthology)
Curated references with context
Core papers
-
RoBERTa. Dynamic masking and scale drive gains. Baseline for English encoders. (arXiv)
-
SpanBERT. Span masking reduces leakage and improves QA and coref. Use when spans matter. (ACL Anthology)
-
Don’t Stop Pretraining. Most domain gains come from continued in-domain pretraining. Do this first. (ACL Anthology)
-
Chinese WWM and MacBERT family. Clearer WWM benefits. Useful contrast if your domain has many multi-piece “words.” (arXiv)
HF ecosystem and issues
-
BERT-WWM model card. Confirms identical MLM loss and WWM as selection policy. (huggingface.co)
-
Data collator docs. Note about BERT
##dependency and fallback behavior. (huggingface.co) -
HF forum tip. Keep
remove_unused_columns=Falseto preserveword_ids. (Hugging Face Forums) -
Fairseq RoBERTa issue. Authors: WWM “didn’t seem to help”; span masking suggested. (GitHub)
Bottom line
Treat WWM as an ablation, not a default. Use it when your domain terms split often or when spans are central. Otherwise stick to dynamic masking and invest effort in data, steps, and evaluation. (ar5iv)