Training data availability

Kaleidophon commented 3 years ago

Hey!

I wanted to inquire whether the training data for Bertje is available anywhere, I didn't see it in the repo. Dankjewel for any help!

wietsedv commented 3 years ago

Sorry for the extremely slow reply. (Github should send reminders of unadressed issues.)

We did indeed not release the training data for both practical and legal reasons. As you can read in the paper, we used the following sources (as you can read in the paper):

Newspapers: Twente News Corpus (https://research.utwente.nl/en/publications/twnc-a-multifaceted-dutch-news-corpus)
Books (mostly fiction): Unfortunately not publicly available and unsharable
Wikipedia (https://dumps.wikimedia.org/other/cirrussearch/current/)
Scraped Web news: nos.nl, telegraaf.nl, nu.nl, ad.nl 2015-2019 from the top of my head
SoNaR-500: https://taalmaterialen.ivdnt.org/download/tstc-sonar-corpus/

If you are just experimenting with large amounts of unlabeled data, I recommend you first just look at Wikipedia and SoNaR-500. If you want to train a model like BERTje from scratch, make sure that you actually need/want to do so. The 99% solution is fine-tuning off-the-shelf BERTje.

Kaleidophon commented 3 years ago

Dankjewel, better late than never! ;-)

wietsedv / bertje

Training data availability #22