CD-Text cannot be parsed, raising ValueErrors

whipper-team / whipper

Python CD-DA ripper preferring accuracy over speed

GNU General Public License v3.0

1.14k stars 89 forks source link

CD-Text cannot be parsed, raising ValueErrors #169

Open MerlijnWajer opened 7 years ago

MerlijnWajer commented 7 years ago

I have several CDs that have possibly invalid CD-Text.

Whipper will raise this error when parsing it:

u"(ValueError('Trailing \\ in string',)

I added some debug to the toc parsing, and found that this is the value that it cannot parse properly:

Read: key: u'TITLE', value: u'\267|\246\250\245\'

(The TOC contains every slash once, not twice, but this is the representation of the string printed)

The simple fix would be to strip trailing slashes. I will do that now and test it.

MerlijnWajer commented 7 years ago

This is code in question (image/toc.py):

value = value.decode('string-escape').decode('latin-1')

MerlijnWajer commented 7 years ago

This patch fixes it for me:

diff --git a/morituri/image/toc.py b/morituri/image/toc.py
index c83e940..7a5dab4 100644
--- a/morituri/image/toc.py
+++ b/morituri/image/toc.py
@@ -195,6 +195,15 @@ class TocFile(object, log.Loggable):
             if m:
                 key = m.group('key')
                 value = m.group('value')
+
+                # Occasionally, CDRDAO will contain CD-TEXT that ends with a
+                # slash. This will case value.decode('string-escape') to fail
+                # with a ValueError. We could -catch- that exception, but it
+                # might be more clean to just strip the trailing slash, as that
+                # seems to be the main/only issue right now.
+                while value.endswith('\\'):
+                    value = value[:-1]
+
                 # usually, value is encoded with octal escapes and in latin-1
                 # FIXME: other encodings are possible, does cdrdao handle
                 # them ?

MerlijnWajer commented 6 years ago

Maybe we can just merge this? Seems useful and also harmless.

JoeLametta commented 6 years ago

Maybe the double \ is there because of something related to this??

The 12 payload bytes contain pieces of zero terminated data. When double-byte text is used the zero is a double byte, otherwise it is a single ASCII NUL.

https://www.gnu.org/software/libcdio/cd-text-format.html#Pack-Contents

JoeLametta commented 5 years ago

@MerlijnWajer

Found the explanation:

CD-Text can also use the Shift JIS double byte encoding which, I think, cdrdao supports (and we're not handling in whipper). It's not allowed to mix Shift JIS and IEC 8859-1 (latin-1).

The character code is defined as follows:

$00         = ISO/IEC 8859-1 (modified, see CD EXTRA Specification, appendix 1)
$01         = ISO/IEC 646, ASCII (7 bit)
$02 .. $7F  = Reserved
$80         = Music Shift-JIS Kanji
$81         = Korean character code (to be defined)
$82         = Mandarin Chinese character code (to be defined)
$83 .. $FF  = Reserved

The character code indicates the character set used to code the character strings of the
PACKs with ID1 = $80 through $85. Other PACKs shall have character code $00 (ISO 8859-1
modified).

JoeLametta commented 5 years ago

What's the purpose of the string-escape decoding step? To fix this issue, I think we should just escape the trailing backslash character (if present).

ntrrgc commented 1 week ago

Since cdrdao 1.2.5 (2023-02-03), by default the .toc file encodes strings in UTF-8 and backslash escape octal sequences are no longer used. The old behavior can still be accessed by passing --no-utf8 to cdrdao. However, the old behavior is undesirable because of a bug in the escape sequence generator: https://github.com/cdrdao/cdrdao/issues/32

Even if that bug is eventually fixed and whipper passed --no-utf8, it must be noted that the no-utf8 mode encodes bytes, not characters. Therefore, this code in whipper is still wrong:

# usually, value is encoded with octal escapes and in latin-1
# FIXME: other encodings are possible, does cdrdao handle
# them ?
value = value.encode().decode('unicode_escape')

The 'unicode_escape' works only for ASCII and Latin-1, and only because their codepoints match exactly into Unicode codepoints. But CD-Text can also be encoded in MS-JIS (Shift-JIS) and treating those byte values as Latin-1 would produce mojibake.