Closed jvdzwaan closed 4 years ago
For the pre-training data generation (SOP + Whole-Word masking) I used the code of ALBERT without any siginificant modifications as far as I remember: https://github.com/google-research/albert/blob/master/create_pretraining_data.py
For the pre-training itself I used the code of original BERT. The pre-training data generation and BERT pre-training code are compatible, so this does not require modifications: https://github.com/google-research/bert/blob/master/run_pretraining.py
Hope this helps!
Thank you for creating BERTje and making it available! In the paper I read you adjusted the pre-training tasks (Sentence Order Prediction and masking tokens instead of individual word pieces), would you be willing to share your pre-training code? I would like to continue pre-training on my own corpus.