wietsedv / bertje

BERTje is a Dutch pre-trained BERT model developed at the University of Groningen. (EMNLP Findings 2020) "What’s so special about BERT’s layers? A closer look at the NLP pipeline in monolingual and multilingual models"
https://aclanthology.org/2020.findings-emnlp.389/
Apache License 2.0
135 stars 10 forks source link

bert-base-dutch-uncased #21

Closed davidavdav closed 3 years ago

davidavdav commented 3 years ago

Hello,

Thanks for sharing this work! Not being hindered by too much knowledge about BERT models, would it be difficult (for you) to train a model bert-base-dutch-uncased (being more similar to the English counterpart), or is it in some way trivial for me to map an uncased tokenizer to the cased tokenizer?

The use case I have is for post-procesing automatic speech recognition output. We are used to making caseful vocabularies in Dutch ASR (in a similar way contrasting to the DEFAULT APPROACH IN ENGLISH), but with a BERT model as backend to ASR, I believe the case disambiguation should, in the end, be solved there. Hence I am looking, at the input side, for a caseless Dutch tokenizer.

Thanks!

wietsedv commented 3 years ago

It is non-trivial (and quite expensive) to retrain an uncased model and I do not have the intention to do this right now. If you want to use BERTje with uncased data, I recommend you try some different approaches and see which one results in the best downstream performance. Things you can try:

  1. Just use BERTje as-is. It should not have too much trouble with fully uncased data.
  2. Heuristically add casing to your data. At least start each sentence with a capital.
  3. Uncase the BERTje vocabulary. You can download the model and modify the vocabulary. Many subword tokens are already in the vocabulary as both uncased and partially cased versions. You can iterate through the vocabulary and lowercase each token that is not yet present as a fully lowercase token. (Do not lowercase everything, because then you will get unwanted duplicates).
davidavdav commented 3 years ago

Thanks,

Option 3 might be the thing to do in my case.