wietsedv / bertje

BERTje is a Dutch pre-trained BERT model developed at the University of Groningen. (EMNLP Findings 2020) "What’s so special about BERT’s layers? A closer look at the NLP pipeline in monolingual and multilingual models"
https://aclanthology.org/2020.findings-emnlp.389/
Apache License 2.0
133 stars 10 forks source link

Training data availability #22

Closed Kaleidophon closed 3 years ago

Kaleidophon commented 3 years ago

Hey!

I wanted to inquire whether the training data for Bertje is available anywhere, I didn't see it in the repo. Dankjewel for any help!

wietsedv commented 3 years ago

Sorry for the extremely slow reply. (Github should send reminders of unadressed issues.)

We did indeed not release the training data for both practical and legal reasons. As you can read in the paper, we used the following sources (as you can read in the paper):

If you are just experimenting with large amounts of unlabeled data, I recommend you first just look at Wikipedia and SoNaR-500. If you want to train a model like BERTje from scratch, make sure that you actually need/want to do so. The 99% solution is fine-tuning off-the-shelf BERTje.

Kaleidophon commented 3 years ago

Dankjewel, better late than never! ;-)