zufuliu / notepad4

Notepad4 (Notepad2⨯2, Notepad2++) is a light-weight Scintilla based text editor for Windows with syntax highlighting, code folding, auto-completion and API list for many programming languages and documents, bundled with file browser plugin matepath.
Other
2.8k stars 187 forks source link

Change Encoding-Detector to UCHARDET #334

Open ghost opened 3 years ago

ghost commented 3 years ago

Mozilla's (u)chardet would generate better results, I would like to switch to this Encoding-Detector.

Uchardet is an encoding detector library, which takes a sequence of bytes in an unknown character encoding without any additional information, and attempts to determine the encoding of the text. Returned encoding names are iconv-compatible. Uchardet started as a C language binding of the original C++ implementation of the universal charset detection library by Mozilla. It can now detect more charsets, and more reliably than the original implementation. https://www.freedesktop.org/wiki/Software/uchardet/ https://github.com/PyYoshi/uchardet

I experienced the excellent recognition rate of UCHARDET in TextPro. I submitted this to Notepad3 on Mar 2019 and it was accepted soon: https://github.com/rizonesoft/Notepad3/issues/973 "With its training ability and its detection parameter in "%", UCHARDET is really superior !"

Thank you!

zufuliu commented 3 years ago

I'm not interested on encoding detection library:

  1. UTF-8 is getting more and more poplar, and is the default encoding used by Notepad2 and many other applications (including Windows Notepad on Win10). We already used very fast UTF-8 validation codes (from https://github.com/zwegner/faster-utf8-validator and https://bjoern.hoehrmann.de/utf-8/decoder/dfa/).
  2. UTF-16 and UTF-32 (not supported, rarely used as storage format/encoding) files must beginning with BOM, not detection needed.
  3. For other legacy encodings, we currently default to Windows ANSI code page (GetACP()), which is most likely the case.
ghost commented 3 years ago

Good!

zufuliu commented 3 years ago

The article on https://hsivonen.fi/chardetng/ says Firefox has switched to https://github.com/hsivonen/chardetng (Rust) since Firefox 73, https://github.com/PyYoshi/uchardet has no updates in past year.

ghost commented 3 years ago

Thanks for reminding me! UCHARDET is probably the best Chinese character detector I have used. This is the reason I recommend it to you.

zufuliu commented 3 years ago

Because their GPL licenses, the updated uchardet can not be used with Notepad2 unless built it as a DLL.

levicki commented 2 years ago

Mozilla's (u)chardet would generate better results...

Here is just one example where it wouldn't:

test.txt