wrye-bash / wrye-bash

A swiss army knife for modding Bethesda games.
https://wrye-bash.github.io
GNU General Public License v3.0
455 stars 79 forks source link

Consider replacing chardet with charset-normalizer #651

Closed Infernio closed 1 year ago

Infernio commented 1 year ago

charset-normalizer claims to be more accurate and faster than chardet. It's used by some notable big projects already (e.g. requests).

Advantages:

Disadvantages:

Infernio commented 1 year ago

I'm not sure we actually want to do this after all:

  1. I had to specifically tell it to avoid 'big5' and 'big5hkscs', otherwise it would detect pretty much everything encoded in Windows-1251 as Big5. Seeing as Windows-1251 is very relevant to us (it's what the Russian version of the game uses), and Big5 isn't relevant at all, that's a problem.
  2. Most of the unit test failures resulting from this are easily fixed and not a concern at all - down to it just doing a better job of detecting 'weird' encodings like Windows-932 - but this one is:

    image

    That demonstrates exactly the concern I had in the 'Disadvantages' section. For decoder, this library probably works pretty well, but for encode this doesn't really work at all since it'll happily write out nonsense like 'Russian encoded in Windows-932', then correctly detect the weird encoding, which leads encode to think this is definitely the right encoding for the job.

I'll just shelve this for now. Maybe once we've completely rethought our encoding handling for plugins - see #42 and #46 - we can take another look at this.

Ideally we'd have per-plugin encodings, with a command that explicitly performs the autodetection once and stores the result as the per-plugin encoding. That option could then read every encoded string in the plugin with an unknown encoding, join them all with newlines into one long bytestring, then call our charset detection library on that one long string. That would give the library much more data to work with for determing the encoding - which must be the same between all those plugin strings.

Utumno commented 1 year ago

Yep pretty much - another line of attack is PluginStr - working on that after 647 and setDefault (which spawned 647)