Pretraining on a different domain

wietsedv / bertje

BERTje is a Dutch pre-trained BERT model developed at the University of Groningen. (EMNLP Findings 2020) "What’s so special about BERT’s layers? A closer look at the NLP pipeline in monolingual and multilingual models"

Apache License 2.0

133 stars 10 forks source link

Hi Wietse,

Thnx for BERTje and the great amount of work. My question is, how hard is it if I wanted to pretrain BERTje on for example dutch legal documents. To my understanding finetuning is the case when you have a downstream task (for example classification). Would it be beneficial to pretrain bertje on these legal documents first or is it sufficient to fine tune only for the downstream task. I can imagine that sentences and the tokens could be different for a different domain and therefor do also some pretraining with this set of documents?

wietsedv / bertje

Pretraining on a different domain #16