whoosh-community / whoosh

Whoosh is a fast, featureful full-text indexing and searching library implemented in pure Python.
Other
240 stars 36 forks source link

ISRI Arabic Stemmer broken on Python >= 3.6 #550

Closed b2m closed 4 years ago

b2m commented 4 years ago

The ISRI ARabic Stemmer (src.whoosh.lang.isry.py) does not work on Python >= 3.6.

Exception: re.error: bad escape \u at position 0. Reason: changed behavior of re.sub.

Changed in version 3.6: Unknown escapes in pattern consisting of '\' and an ASCII letter now are errors.

(quoted from https://docs.python.org/3/library/re.html)

Code snippet to reproduce:

from whoosh.analysis import LanguageAnalyzer

analyzer = LanguageAnalyzer(lang='ar')
[(token.text, token.stopped) for token in analyzer("This is a test")]

Codesamples with bad escape sequences:

stevennic commented 4 years ago

Thanks Benjamin, good catch. I have submitted a PR to fix this.

nijel commented 4 years ago

Should be fixed by https://github.com/whoosh-community/whoosh/pull/557