Closed jwijffels closed 3 years ago
Thanks for letting me know. I do not use Windows and the code is mainly here for reference and not literal usage, so I do not really intend to fix it.
But if the fix is obvious, I can fix it. What python version are you using?
I ran this on Windows with the following. It's really not crucial. I just wanted to understand the data structure of the input to the finetuning training.
$ python --version
Python 3.6.10 :: Anaconda, Inc.
Python 3.6 should work. I will close this issue, but if this is a real problem for someone else: feel free to reopen this issue.
The fix is to specify utf8 encoding everywhere, or set it as the default. Windows has a different default (cp1252) because compatibility is more important to them than unicode. Another solution is to not use windows. Install Ubuntu with WSL etc.
I'm looking into this model in order to finetune a NER task on 18th-19th century Dutch texts. While I was preparing my data (I'm on Windows for data preparation and finetuning will happen on Google Colab) and looking at the structure you require the data to be as input to your finetuning script, I ran the prepare-ud script. That gave me the following error on Windows, I had to spin up an Ubuntu machine where the code did work. Just putting this information here so you are aware of the encoding issue.