zopefoundation / Products.ZCatalog

Zope's indexing and search solution.
Other
5 stars 22 forks source link

Inefficient phrase searches in modern versions #131

Open d-maurer opened 2 years ago

d-maurer commented 2 years ago

Phrase searches use "WidCode"s (i.e. "WordInDex" codes). A "WidCode" is a string which represents a sequence of integers. The representation is particular efficient if the integers are small. Large integers may require up to 3 times the space of small integers.

In former versions, Lexicon tried hard to assign small integers as word indices. In modern versions, the word index is chosen randomly -- avoiding the values for which the "WidCode" is particularly efficient.

The source comment https://github.com/zopefoundation/Products.ZCatalog/blob/e033d4cc464d0d613485b7914b8873a387e94af7/src/Products/ZCTextIndex/Lexicon.py#L145-L148 may indicate a reason: apparently, the author thought, he must avoid values below 0x4000. However, https://github.com/zopefoundation/Products.ZCatalog/blob/e033d4cc464d0d613485b7914b8873a387e94af7/src/Products/ZCTextIndex/WidCode.py#L68-L72 shows that value below 0x4000 are precomputed and therefore particularly (computation) efficient.