Open MerlijnWajer opened 7 years ago
This is code in question (image/toc.py):
value = value.decode('string-escape').decode('latin-1')
This patch fixes it for me:
diff --git a/morituri/image/toc.py b/morituri/image/toc.py
index c83e940..7a5dab4 100644
--- a/morituri/image/toc.py
+++ b/morituri/image/toc.py
@@ -195,6 +195,15 @@ class TocFile(object, log.Loggable):
if m:
key = m.group('key')
value = m.group('value')
+
+ # Occasionally, CDRDAO will contain CD-TEXT that ends with a
+ # slash. This will case value.decode('string-escape') to fail
+ # with a ValueError. We could -catch- that exception, but it
+ # might be more clean to just strip the trailing slash, as that
+ # seems to be the main/only issue right now.
+ while value.endswith('\\'):
+ value = value[:-1]
+
# usually, value is encoded with octal escapes and in latin-1
# FIXME: other encodings are possible, does cdrdao handle
# them ?
Maybe we can just merge this? Seems useful and also harmless.
Maybe the double \
is there because of something related to this??
The 12 payload bytes contain pieces of zero terminated data. When double-byte text is used the zero is a double byte, otherwise it is a single ASCII NUL.
https://www.gnu.org/software/libcdio/cd-text-format.html#Pack-Contents
@MerlijnWajer
Found the explanation:
CD-Text can also use the Shift JIS double byte encoding which, I think, cdrdao supports (and we're not handling in whipper). It's not allowed to mix Shift JIS and IEC 8859-1 (latin-1).
The character code is defined as follows:
$00 = ISO/IEC 8859-1 (modified, see CD EXTRA Specification, appendix 1)
$01 = ISO/IEC 646, ASCII (7 bit)
$02 .. $7F = Reserved
$80 = Music Shift-JIS Kanji
$81 = Korean character code (to be defined)
$82 = Mandarin Chinese character code (to be defined)
$83 .. $FF = Reserved
The character code indicates the character set used to code the character strings of the
PACKs with ID1 = $80 through $85. Other PACKs shall have character code $00 (ISO 8859-1
modified).
What's the purpose of the string-escape
decoding step?
To fix this issue, I think we should just escape the trailing backslash character (if present).
Since cdrdao 1.2.5 (2023-02-03), by default the .toc file encodes strings in UTF-8 and backslash escape octal sequences are no longer used. The old behavior can still be accessed by passing --no-utf8
to cdrdao. However, the old behavior is undesirable because of a bug in the escape sequence generator: https://github.com/cdrdao/cdrdao/issues/32
Even if that bug is eventually fixed and whipper passed --no-utf8
, it must be noted that the no-utf8 mode encodes bytes, not characters. Therefore, this code in whipper is still wrong:
# usually, value is encoded with octal escapes and in latin-1
# FIXME: other encodings are possible, does cdrdao handle
# them ?
value = value.encode().decode('unicode_escape')
The 'unicode_escape' works only for ASCII and Latin-1, and only because their codepoints match exactly into Unicode codepoints. But CD-Text can also be encoded in MS-JIS (Shift-JIS) and treating those byte values as Latin-1 would produce mojibake.
I have several CDs that have possibly invalid CD-Text.
Whipper will raise this error when parsing it:
I added some debug to the toc parsing, and found that this is the value that it cannot parse properly:
(The TOC contains every slash once, not twice, but this is the representation of the string printed)
The simple fix would be to strip trailing slashes. I will do that now and test it.