Closed Infernio closed 1 year ago
I'm not sure we actually want to do this after all:
Most of the unit test failures resulting from this are easily fixed and not a concern at all - down to it just doing a better job of detecting 'weird' encodings like Windows-932 - but this one is:
That demonstrates exactly the concern I had in the 'Disadvantages' section. For decoder
, this library probably works pretty well, but for encode
this doesn't really work at all since it'll happily write out nonsense like 'Russian encoded in Windows-932', then correctly detect the weird encoding, which leads encode
to think this is definitely the right encoding for the job.
I'll just shelve this for now. Maybe once we've completely rethought our encoding handling for plugins - see #42 and #46 - we can take another look at this.
Ideally we'd have per-plugin encodings, with a command that explicitly performs the autodetection once and stores the result as the per-plugin encoding. That option could then read every encoded string in the plugin with an unknown encoding, join them all with newlines into one long bytestring, then call our charset detection library on that one long string. That would give the library much more data to work with for determing the encoding - which must be the same between all those plugin strings.
Yep pretty much - another line of attack is PluginStr - working on that after 647 and setDefault (which spawned 647)
charset-normalizer claims to be more accurate and faster than chardet. It's used by some notable big projects already (e.g. requests).
Advantages:
Disadvantages:
Quote from the readme:
That might not gel well with our use case, especially when we then use the found encoding to write with afterwards. Would need to do some tests to see for sure.