wietsedv / bertje

BERTje is a Dutch pre-trained BERT model developed at the University of Groningen. (EMNLP Findings 2020) "What’s so special about BERT’s layers? A closer look at the NLP pipeline in monolingual and multilingual models"
https://aclanthology.org/2020.findings-emnlp.389/
Apache License 2.0
133 stars 10 forks source link

prepare-ud.py #18

Closed jwijffels closed 3 years ago

jwijffels commented 3 years ago

I'm looking into this model in order to finetune a NER task on 18th-19th century Dutch texts. While I was preparing my data (I'm on Windows for data preparation and finetuning will happen on Google Colab) and looking at the structure you require the data to be as input to your finetuning script, I ran the prepare-ud script. That gave me the following error on Windows, I had to spin up an Ubuntu machine where the code did work. Just putting this information here so you are aware of the encoding issue.

$ python finetuning/prepare/prepare-ud.py -i "C:\Users\Jan\Dropbox\Work\RForgeBNOSAC\OpenSource\UD_Dutch-LassySmall" -o "data"
C:\Users\Jan\Dropbox\Work\RForgeBNOSAC\OpenSource\UD_Dutch-LassySmall
data
 > Preparing NER data
Traceback (most recent call last):
  File "finetuning/prepare/prepare-ud.py", line 104, in <module>
    main()
  File "finetuning/prepare/prepare-ud.py", line 100, in main
    save_data(prepare_ud(args.in_path), args.out_path)
  File "finetuning/prepare/prepare-ud.py", line 36, in prepare_ud
    train = read_conllu(train_path)
  File "finetuning/prepare/prepare-ud.py", line 9, in read_conllu
    for line in f:
  File "C:\Anaconda3\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 1287: character maps to <undefined>
wietsedv commented 3 years ago

Thanks for letting me know. I do not use Windows and the code is mainly here for reference and not literal usage, so I do not really intend to fix it.

But if the fix is obvious, I can fix it. What python version are you using?

jwijffels commented 3 years ago

I ran this on Windows with the following. It's really not crucial. I just wanted to understand the data structure of the input to the finetuning training.

$ python --version
Python 3.6.10 :: Anaconda, Inc.
wietsedv commented 3 years ago

Python 3.6 should work. I will close this issue, but if this is a real problem for someone else: feel free to reopen this issue.

andreasvc commented 3 years ago

The fix is to specify utf8 encoding everywhere, or set it as the default. Windows has a different default (cp1252) because compatibility is more important to them than unicode. Another solution is to not use windows. Install Ubuntu with WSL etc.