UnicodeDecodeError on some passages

zhammer commented 6 years ago

>>> read_current_passage(scroll, 40, 4, 1523814743)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/zhammer/code/finnegan-forever/finnegan_forever/read_current_passage.py", line 14, in read_current_passage
    return scroll.read_passage(passage_offset, passage_size)
  File "/Users/zhammer/code/finnegan-forever/finnegan_forever/gateways/scroll_gateway.py", line 34, in read_passage
    return scroll_file.read(passage_size)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x99 in position 0: invalid start byte

zhammer commented 6 years ago

passage offset is 692200

zhammer commented 6 years ago

The issue is seeking into a non-start utf-8 byte. Actually exposes a major issue in the ScrollGateway that returns the byte size of a file as the # of chars length.

I'm going to build a script that converts any file (though should be non-ascii, since ScrollGateway by default should handle ascii scrolls w/ the current implementation) and converts to utf32. Script then normalizes the utf32 data (https://docs.python.org/3/library/unicodedata.html#unicodedata.normalize) and removes all combining characters (https://docs.python.org/3/library/unicodedata.html#unicodedata.combining) printing the combining char that's removed to stderr.

I want to somehow tie it into https://en.wikipedia.org/wiki/Sefer_Torah

zhammer commented 6 years ago

Info an all non-ascii chars in finnegans wake. (Newline accidentally included in the set.)

>>> from pprint import pprint as pp
>>> pp({c: unicodedata.name(c, 'None') for c in list(utf_orig)})
{'\n': 'None',
 '¤': 'CURRENCY SIGN',
 '·': 'MIDDLE DOT',
 'à': 'LATIN SMALL LETTER A WITH GRAVE',
 'á': 'LATIN SMALL LETTER A WITH ACUTE',
 'ã': 'LATIN SMALL LETTER A WITH TILDE',
 'é': 'LATIN SMALL LETTER E WITH ACUTE',
 'ì': 'LATIN SMALL LETTER I WITH GRAVE',
 'ó': 'LATIN SMALL LETTER O WITH ACUTE',
 'ô': 'LATIN SMALL LETTER O WITH CIRCUMFLEX',
 'þ': 'LATIN SMALL LETTER THORN',
 'Œ': 'LATIN CAPITAL LIGATURE OE',
 'Š': 'LATIN CAPITAL LETTER S WITH CARON',
 'Ÿ': 'LATIN CAPITAL LETTER Y WITH DIAERESIS',
 'ˆ': 'MODIFIER LETTER CIRCUMFLEX ACCENT',
 '\u2003': 'EM SPACE',
 '–': 'EN DASH',
 '—': 'EM DASH',
 '‘': 'LEFT SINGLE QUOTATION MARK',
 '’': 'RIGHT SINGLE QUOTATION MARK',
 '‚': 'SINGLE LOW-9 QUOTATION MARK',
 '“': 'LEFT DOUBLE QUOTATION MARK',
 '”': 'RIGHT DOUBLE QUOTATION MARK',
 '‡': 'DOUBLE DAGGER',
 '…': 'HORIZONTAL ELLIPSIS',
 '‹': 'SINGLE LEFT-POINTING ANGLE QUOTATION MARK'}
>>> pp({c: unicodedata.category(c) for c in list(utf_orig)})
{'\n': 'Cc',
 '¤': 'Sc',
 '·': 'Po',
 'à': 'Ll',
 'á': 'Ll',
 'ã': 'Ll',
 'é': 'Ll',
 'ì': 'Ll',
 'ó': 'Ll',
 'ô': 'Ll',
 'þ': 'Ll',
 'Œ': 'Lu',
 'Š': 'Lu',
 'Ÿ': 'Lu',
 'ˆ': 'Lm',
 '\u2003': 'Zs',
 '–': 'Pd',
 '—': 'Pd',
 '‘': 'Pi',
 '’': 'Pf',
 '‚': 'Ps',
 '“': 'Pi',
 '”': 'Pf',
 '‡': 'Po',
 '…': 'Po',
 '‹': 'Pi'}
>>> any(unicodedata.combining(c) for c in list(utf_orig))
False

zhammer / finnegan-forever

UnicodeDecodeError on some passages #1