wietsedv / bertje

BERTje is a Dutch pre-trained BERT model developed at the University of Groningen. (EMNLP Findings 2020) "What’s so special about BERT’s layers? A closer look at the NLP pipeline in monolingual and multilingual models"
https://aclanthology.org/2020.findings-emnlp.389/
Apache License 2.0
133 stars 10 forks source link

Pretraining on a different domain #16

Closed gevezex closed 3 years ago

gevezex commented 3 years ago

Hi Wietse,

Thnx for BERTje and the great amount of work. My question is, how hard is it if I wanted to pretrain BERTje on for example dutch legal documents. To my understanding finetuning is the case when you have a downstream task (for example classification). Would it be beneficial to pretrain bertje on these legal documents first or is it sufficient to fine tune only for the downstream task. I can imagine that sentences and the tokens could be different for a different domain and therefor do also some pretraining with this set of documents?

wietsedv commented 3 years ago

Further pre-training in domain is indeed a good approach. So you can take BERTje and use example code from Huggingface to do additional masked language modeling pre-training for data in your domain and then you can fine-tune on your task. This is especially a useful method if you have a relatively large amount of unlabeled in-domain (legal) document.