Pre-training code - Githubissues

wietsedv / bertje

BERTje is a Dutch pre-trained BERT model developed at the University of Groningen. (EMNLP Findings 2020) "What’s so special about BERT’s layers? A closer look at the NLP pipeline in monolingual and multilingual models"

https://aclanthology.org/2020.findings-emnlp.389/

Apache License 2.0

135 stars 10 forks source link

Pre-training code #7

Closed jvdzwaan closed 4 years ago

jvdzwaan commented 4 years ago

Thank you for creating BERTje and making it available! In the paper I read you adjusted the pre-training tasks (Sentence Order Prediction and masking tokens instead of individual word pieces), would you be willing to share your pre-training code? I would like to continue pre-training on my own corpus.

wietsedv commented 4 years ago

For the pre-training data generation (SOP + Whole-Word masking) I used the code of ALBERT without any siginificant modifications as far as I remember: https://github.com/google-research/albert/blob/master/create_pretraining_data.py

For the pre-training itself I used the code of original BERT. The pre-training data generation and BERT pre-training code are compatible, so this does not require modifications: https://github.com/google-research/bert/blob/master/run_pretraining.py

Hope this helps!