Hi @smalltoken, what is the issue with https://huggingface.co/blog/how-to-train ?
This colab should help you. It walks you through,
- How to to train tokenizer from scratch
- Create
RobertaModelusing the config - use the
DataCollatorForLanguageModeling, which handle the masking - and train using
Trainer.