Hi to all,
I tried to use the running run mlm.py to reproduce the result of the bert-base-uncased version. However, I found my reproduced results are always lower than the one reported in this website provided by the Huggingface team.
| Task | Metric | Reported by Huggingface | Our reproduced result |
|---|---|---|---|
| CoLA | Matthew’s corr | 56.53 | 47.92 |
| SST-2 | Accuracy | 92.32 | 87.56 |
| MRPC | F1/Accuracy | 88.85/84.07 | 82.03/80.97 |
| STS-B | Person/Spearman corr. | 88.64/88.48 | 82.45/82.76 |
| QQP | Accuracy/F1 | 90.71/87.49 | 88.23/86.12 |
| MNLI | Matched acc./Mismatched acc. | 83.91/84.10 | 82.34/83.01 |
| QNLI | Accuracy | 90.66 | 85.45 |
| RTE | Accuracy | 65.70 | 56.95 |
I think there must be some problems with my experiment. I ran my experiment by using:
(1) I used the code in this file without any change.
(2) I loaded the datasets of bookcorpus and wiki directly from dataset library; the text is chunked into 512 tokens.
(3) I set the batch size as 256 and ran 1M steps; and batch size as 8K and ran 50K steps. Both results are worse than the reported numbers.
I really apprecitate if you could provide me a script that I can use to reproduce BERT or RoBERTa. Thank you very much!