Closed mgedmin closed 6 years ago
To detect the encoding, on Python 2 we use chardet
and on Python 3 we use tokenize
. See https://github.com/zestsoftware/zest.releaser/blob/6.13.5/zest/releaser/utils.py#L93-L108
In debug mode, I see this with Python 2:
$ prerelease -v
...
DEBUG: Checking CHANGES.rst
DEBUG: EUC-TW Taiwan prober hit error at byte 213
DEBUG: utf-8 confidence = 0.505
DEBUG: SHIFT_JIS Japanese confidence = 0.01
DEBUG: EUC-JP Japanese confidence = 0.01
DEBUG: GB2312 Chinese confidence = 0.01
DEBUG: EUC-KR Korean confidence = 0.01
DEBUG: CP949 Korean confidence = 0.01
DEBUG: Big5 Chinese confidence = 0.01
DEBUG: EUC-TW not active
DEBUG: windows-1251 Russian confidence = 0.01
DEBUG: KOI8-R Russian confidence = 0.01
DEBUG: ISO-8859-5 Russian confidence = 0.01
DEBUG: MacCyrillic Russian confidence = 0.01
DEBUG: IBM866 Russian confidence = 0.01
DEBUG: IBM855 Russian confidence = 0.01
DEBUG: ISO-8859-7 Greek confidence = 0.01
DEBUG: windows-1253 Greek confidence = 0.01
DEBUG: ISO-8859-5 Bulgairan confidence = 0.01
DEBUG: windows-1251 Bulgarian confidence = 0.01
DEBUG: TIS-620 Thai confidence = 0.0
DEBUG: ISO-8859-9 Turkish confidence = 0.512703226996
DEBUG: windows-1255 Hebrew confidence = 0.0
DEBUG: windows-1255 Hebrew confidence = 0.0
DEBUG: windows-1255 Hebrew confidence = 0.0
DEBUG: Detected encoding of CHANGES.rst with chardet: ISO-8859-1
Okay, that looks a bit weird. I see utf-8
at 0.505
and ISO-8859-9 Turkish
a tiny bit higher at 0.512703226996
. But I don't see the reported ISO-8859-1
. I don't know where that one comes from. But at least utf-8
is not the best scoring encoding here, so I can imagine that this fails.
chardet
also comes with a command line utility. Let's see:
$ chardetect CHANGES.rst
CHANGES.rst: ISO-8859-1 with confidence 0.73
So chardet
really wrongly detects it. Sounds like this bug: https://github.com/chardet/chardet/issues/138
When I try prerelease -v
with Python 3, I get this:
DEBUG: Detected encoding of CHANGES.rst with tokenize: utf-8
I wonder if it makes sense to reorder our detection code and first try looking for encoding hints before using chardet
or tokenize
.
I wonder if it makes sense to reorder our detection code and first try looking for encoding hints before using chardet or tokenize.
I made PR https://github.com/zestsoftware/zest.releaser/pull/272 that does this. If you add a marker for the encoding at the top of CHANGES.rst
it will work:
# -*- coding:utf-8
Wait, that gives a ReStructuredText error: 'Inline emphasis start-string without end-string.' And the line would appear literally in the rendered html if it would work... Try this:
.. # coding=utf-8
I think that is the best we can do, given that chardet
gives us the wrong encoding.
Does this seem workable?
Personally I'd prefer a [zest.releaser] config option in setup.cfg for overriding the magic charset detection bits. changelog_encoding = UTF-8 or something like that.
That would be a useful safety valve!
That could help.
It would not be only for the changelog though, but also for setup.py
or whatever other file we read or write.
So in our read_text_file
function, the best order would probably be:
# -*- coding:utf-8
).setup.cfg/.pypirc
.tokenize/chardet
.I'll see what I can do.
Done in PR #275. When I put this in setup.cfg
(or ~/.pypirc
) it works:
[zest.releaser]
encoding = utf-8
Released with several other fixes in 6.14.0.
I was cutting the 2.17.0 release of irclog2html, and saw this:
Note how the changelog text was misidentified as Latin-1 and displayed wrong. The file itself is UTF-8:
Luckily the text is only displayed wrong, not actually written wrong into the file.
My locale settings are correct (
locale charmap
printsUTF-8
). I'm using Python 2.prerelease --version
printsprerelease: error: unrecognized arguments: --version
, but anyway I just did apip install -U zest.releaser
so it's the latest.