Closed zhammer closed 6 years ago
passage offset is 692200
The issue is seeking into a non-start utf-8 byte. Actually exposes a major issue in the ScrollGateway that returns the byte size of a file as the # of chars length.
I'm going to build a script that converts any file (though should be non-ascii, since ScrollGateway by default should handle ascii scrolls w/ the current implementation) and converts to utf32. Script then normalizes the utf32 data (https://docs.python.org/3/library/unicodedata.html#unicodedata.normalize) and removes all combining characters (https://docs.python.org/3/library/unicodedata.html#unicodedata.combining) printing the combining char that's removed to stderr.
I want to somehow tie it into https://en.wikipedia.org/wiki/Sefer_Torah
Info an all non-ascii chars in finnegans wake. (Newline accidentally included in the set.)
>>> from pprint import pprint as pp
>>> pp({c: unicodedata.name(c, 'None') for c in list(utf_orig)})
{'\n': 'None',
'¤': 'CURRENCY SIGN',
'·': 'MIDDLE DOT',
'à': 'LATIN SMALL LETTER A WITH GRAVE',
'á': 'LATIN SMALL LETTER A WITH ACUTE',
'ã': 'LATIN SMALL LETTER A WITH TILDE',
'é': 'LATIN SMALL LETTER E WITH ACUTE',
'ì': 'LATIN SMALL LETTER I WITH GRAVE',
'ó': 'LATIN SMALL LETTER O WITH ACUTE',
'ô': 'LATIN SMALL LETTER O WITH CIRCUMFLEX',
'þ': 'LATIN SMALL LETTER THORN',
'Œ': 'LATIN CAPITAL LIGATURE OE',
'Š': 'LATIN CAPITAL LETTER S WITH CARON',
'Ÿ': 'LATIN CAPITAL LETTER Y WITH DIAERESIS',
'ˆ': 'MODIFIER LETTER CIRCUMFLEX ACCENT',
'\u2003': 'EM SPACE',
'–': 'EN DASH',
'—': 'EM DASH',
'‘': 'LEFT SINGLE QUOTATION MARK',
'’': 'RIGHT SINGLE QUOTATION MARK',
'‚': 'SINGLE LOW-9 QUOTATION MARK',
'“': 'LEFT DOUBLE QUOTATION MARK',
'”': 'RIGHT DOUBLE QUOTATION MARK',
'‡': 'DOUBLE DAGGER',
'…': 'HORIZONTAL ELLIPSIS',
'‹': 'SINGLE LEFT-POINTING ANGLE QUOTATION MARK'}
>>> pp({c: unicodedata.category(c) for c in list(utf_orig)})
{'\n': 'Cc',
'¤': 'Sc',
'·': 'Po',
'à': 'Ll',
'á': 'Ll',
'ã': 'Ll',
'é': 'Ll',
'ì': 'Ll',
'ó': 'Ll',
'ô': 'Ll',
'þ': 'Ll',
'Œ': 'Lu',
'Š': 'Lu',
'Ÿ': 'Lu',
'ˆ': 'Lm',
'\u2003': 'Zs',
'–': 'Pd',
'—': 'Pd',
'‘': 'Pi',
'’': 'Pf',
'‚': 'Ps',
'“': 'Pi',
'”': 'Pf',
'‡': 'Po',
'…': 'Po',
'‹': 'Pi'}
>>> any(unicodedata.combining(c) for c in list(utf_orig))
False