Closed Kaleidophon closed 3 years ago
Sorry for the extremely slow reply. (Github should send reminders of unadressed issues.)
We did indeed not release the training data for both practical and legal reasons. As you can read in the paper, we used the following sources (as you can read in the paper):
If you are just experimenting with large amounts of unlabeled data, I recommend you first just look at Wikipedia and SoNaR-500. If you want to train a model like BERTje from scratch, make sure that you actually need/want to do so. The 99% solution is fine-tuning off-the-shelf BERTje.
Dankjewel, better late than never! ;-)
Hey!
I wanted to inquire whether the training data for Bertje is available anywhere, I didn't see it in the repo. Dankjewel for any help!