Charset confusion in prerelease

mgedmin commented 6 years ago

I was cutting the 2.17.0 release of irclog2html, and saw this:

$ prerelease
...
Changelog entries for version 2.17.0:

2.17.0 (unreleased)
-------------------

- Support `ii <https://tools.suckless.org/ii/>`_ logs which use Unix
  timestamps (https://github.com/mgedmin/irclog2html/pull/21).
  Pull request by CÃ©dric Krier.
Enter version [2.17.0]:

Note how the changelog text was misidentified as Latin-1 and displayed wrong. The file itself is UTF-8:

$ head -n 10 CHANGES.rst 
Changelog
=========

2.17.0 (2018-02-21)
-------------------

- Support `ii <https://tools.suckless.org/ii/>`_ logs which use Unix
  timestamps (https://github.com/mgedmin/irclog2html/pull/21).
  Pull request by Cédric Krier.

Luckily the text is only displayed wrong, not actually written wrong into the file.

My locale settings are correct (locale charmap prints UTF-8). I'm using Python 2. prerelease --version prints prerelease: error: unrecognized arguments: --version, but anyway I just did a pip install -U zest.releaser so it's the latest.

mauritsvanrees commented 6 years ago

To detect the encoding, on Python 2 we use chardet and on Python 3 we use tokenize. See https://github.com/zestsoftware/zest.releaser/blob/6.13.5/zest/releaser/utils.py#L93-L108

In debug mode, I see this with Python 2:

$ prerelease -v
...
DEBUG: Checking CHANGES.rst
DEBUG: EUC-TW Taiwan prober hit error at byte 213
DEBUG: utf-8  confidence = 0.505
DEBUG: SHIFT_JIS Japanese confidence = 0.01
DEBUG: EUC-JP Japanese confidence = 0.01
DEBUG: GB2312 Chinese confidence = 0.01
DEBUG: EUC-KR Korean confidence = 0.01
DEBUG: CP949 Korean confidence = 0.01
DEBUG: Big5 Chinese confidence = 0.01
DEBUG: EUC-TW not active
DEBUG: windows-1251 Russian confidence = 0.01
DEBUG: KOI8-R Russian confidence = 0.01
DEBUG: ISO-8859-5 Russian confidence = 0.01
DEBUG: MacCyrillic Russian confidence = 0.01
DEBUG: IBM866 Russian confidence = 0.01
DEBUG: IBM855 Russian confidence = 0.01
DEBUG: ISO-8859-7 Greek confidence = 0.01
DEBUG: windows-1253 Greek confidence = 0.01
DEBUG: ISO-8859-5 Bulgairan confidence = 0.01
DEBUG: windows-1251 Bulgarian confidence = 0.01
DEBUG: TIS-620 Thai confidence = 0.0
DEBUG: ISO-8859-9 Turkish confidence = 0.512703226996
DEBUG: windows-1255 Hebrew confidence = 0.0
DEBUG: windows-1255 Hebrew confidence = 0.0
DEBUG: windows-1255 Hebrew confidence = 0.0
DEBUG: Detected encoding of CHANGES.rst with chardet: ISO-8859-1

Okay, that looks a bit weird. I see utf-8 at 0.505 and ISO-8859-9 Turkish a tiny bit higher at 0.512703226996. But I don't see the reported ISO-8859-1. I don't know where that one comes from. But at least utf-8 is not the best scoring encoding here, so I can imagine that this fails.

chardet also comes with a command line utility. Let's see:

$ chardetect CHANGES.rst 
CHANGES.rst: ISO-8859-1 with confidence 0.73

So chardet really wrongly detects it. Sounds like this bug: https://github.com/chardet/chardet/issues/138

When I try prerelease -v with Python 3, I get this:

DEBUG: Detected encoding of CHANGES.rst with tokenize: utf-8

mauritsvanrees commented 6 years ago

I wonder if it makes sense to reorder our detection code and first try looking for encoding hints before using chardet or tokenize.

mauritsvanrees commented 6 years ago

I wonder if it makes sense to reorder our detection code and first try looking for encoding hints before using chardet or tokenize.

I made PR https://github.com/zestsoftware/zest.releaser/pull/272 that does this. If you add a marker for the encoding at the top of CHANGES.rst it will work:

# -*- coding:utf-8

Wait, that gives a ReStructuredText error: 'Inline emphasis start-string without end-string.' And the line would appear literally in the rendered html if it would work... Try this:

.. # coding=utf-8

I think that is the best we can do, given that chardet gives us the wrong encoding. Does this seem workable?

mgedmin commented 6 years ago

Personally I'd prefer a [zest.releaser] config option in setup.cfg for overriding the magic charset detection bits. changelog_encoding = UTF-8 or something like that.

reinout commented 6 years ago

That would be a useful safety valve!

mauritsvanrees commented 6 years ago

That could help. It would not be only for the changelog though, but also for setup.py or whatever other file we read or write. So in our read_text_file function, the best order would probably be:

Look for encoding hints in the file itself (# -*- coding:utf-8).
Look for an encoding setting in setup.cfg/.pypirc.
Try tokenize/chardet.

I'll see what I can do.

mauritsvanrees commented 6 years ago

Done in PR #275. When I put this in setup.cfg (or ~/.pypirc) it works:

[zest.releaser]
encoding = utf-8

mauritsvanrees commented 6 years ago

Released with several other fixes in 6.14.0.

zestsoftware / zest.releaser

Charset confusion in prerelease #264