Closed gevezex closed 3 years ago
Further pre-training in domain is indeed a good approach. So you can take BERTje and use example code from Huggingface to do additional masked language modeling pre-training for data in your domain and then you can fine-tune on your task. This is especially a useful method if you have a relatively large amount of unlabeled in-domain (legal) document.
Hi Wietse,
Thnx for BERTje and the great amount of work. My question is, how hard is it if I wanted to pretrain BERTje on for example dutch legal documents. To my understanding finetuning is the case when you have a downstream task (for example classification). Would it be beneficial to pretrain bertje on these legal documents first or is it sufficient to fine tune only for the downstream task. I can imagine that sentences and the tokens could be different for a different domain and therefor do also some pretraining with this set of documents?