Closed davidavdav closed 3 years ago
It is non-trivial (and quite expensive) to retrain an uncased model and I do not have the intention to do this right now. If you want to use BERTje with uncased data, I recommend you try some different approaches and see which one results in the best downstream performance. Things you can try:
Thanks,
Option 3 might be the thing to do in my case.
Hello,
Thanks for sharing this work! Not being hindered by too much knowledge about BERT models, would it be difficult (for you) to train a model
bert-base-dutch-uncased
(being more similar to the English counterpart), or is it in some way trivial for me to map an uncased tokenizer to the cased tokenizer?The use case I have is for post-procesing automatic speech recognition output. We are used to making caseful vocabularies in Dutch ASR (in a similar way contrasting to the DEFAULT APPROACH IN ENGLISH), but with a BERT model as backend to ASR, I believe the case disambiguation should, in the end, be solved there. Hence I am looking, at the input side, for a caseless Dutch tokenizer.
Thanks!