whoosh-community / whoosh

Whoosh is a fast, featureful full-text indexing and searching library implemented in pure Python.
Other
252 stars 37 forks source link

test_vector_unicode fails in Python 3.5.3 (Debian) #459

Open fortable1999 opened 7 years ago

fortable1999 commented 7 years ago

Original report by Simon McVittie (Bitbucket: smcv, GitHub: smcv).


test_vector_unicode fails under the current Python version in Debian. I'm fixing this during general QA work, so I don't know anything in particular about Whoosh, only that one of its tests fails:

=================================== FAILURES =================================== _____ test_vector_unicode __ Traceback (most recent call last):\n File "/<>/.pybuild/pythonX.Y_3.5/build/tests/test_vectors.py", line 80, in test_vector_unicode\n assert vec[0][0] == u"\u13ac\u13ad\u13ae"\nAssertionError: assert '\uab7c\uab7d\uab7e' == '\u13ac\u13ad\u13ae'\n - \uab7c\uab7d\uab7e\n + \u13ac\u13ad\u13ae\n============================ pytest-warning summary ============================

The particular text used in that test uses Cherokee letters: for example, the first one used is U+13AC CHEROKEE LETTER GV.

Prior to Unicode 8.0, Cherokee was modelled as not having upper or lower case, but this was later decided to have been incorrect. Unicode 8.0 repurposed the existing Cherokee block U+13A0..U+13FF as upper-case Cherokee to reflect their appearance in existing fonts, and introduced new lower-case versions in the range U+AB70..U+ABBF. For example, U+AB7C CHEROKEE SMALL LETTER GV is the lower-case form of U+13AC.

When this test was written, it was presumably run against a pre-Unicode 8.0 version of Python, where the default LowercaseFilter leaves U+13AC intact: u"\u13ac".lower() == u"\u13ac". However, Python 3.5 has Unicode 8.0 tables which result in u"\u13ac".lower() == u"\ab7c".

My proposed patch (attached) makes the test Python-version-independent by asserting that the word found in the frequency analysis is the result of lower(), whatever this Python version thinks that is.

fortable1999 commented 7 years ago

Original comment by Simon McVittie (Bitbucket: smcv, GitHub: smcv).


In fact this has already been fixed in 2.7.1 with commit "Fix the analyzer in test_vector_unicode() to not lowercase, since this makes the test fail on some Python versions"; so please ignore this report, unless you think my solution improves test coverage.

fortable1999 commented 7 years ago

Original comment by Simon McVittie (Bitbucket: smcv, GitHub: smcv).


I don't know anything in particular about Cherokee either, I'm getting all this from Google :-)