zopefoundation / Products.ZCatalog

Zope's indexing and search solution.
Other
5 stars 22 forks source link

"CompositeIndex" does not recognize reliably cases it cannot handle -- and performs wrong optimizations #59

Closed d-maurer closed 5 years ago

d-maurer commented 5 years ago

CompositeIndex analyses a query and rewrites it if the index thinks it can optimize the query. Due to the rewriting it is necessary that it recognizes reliably the cases were the rewrite leads to a semantically equivalent query.

Currently, CompositeIndex operates on the query provided keys only and does not access its (true) component indexes. As a consequence, it cannot optimize a subquery when the relevant keys depend on the true component index, as this is the case for "range" queries.

CompositeIndex can currently wrongly optimize queries containing "range" subqueries - as demonstrated by the example below. In #57, I have provided the new method get_combiner_keys_info which preprocesses a query (record) and determines the keys effectively used by the query. If CompositeIndex were ready to access the true composite index, then it could be used to remove the restriction.

CompositeIndex cannot optimize subqueries containing not. It knows this and tries to recognize the case -- but its condition is wrong (actually, it checks for a "pure not", not a "general not"). As a consequence, it usually produces wrong optimizations for subqueries containing "not" (ignoring the "not") -- example below. CompositeIndex could in principle optimize subqueries with "not". What it essentially does is replace "Or(ai) and Or(bj)" with "Or(ai and bj)". It could also replace "(Or(ai) and NA) and (Or(bj) and NB)" (where the "NA" and "NB" represent optional "not" parts) with "Or(ai and bj) and NA and NB". If not missing, "NA" would be a "pure not". Such "not"s are costly. However, in the case above, a "filtering not" would be sufficient (provided it comes after the big "Or") as we know that any relevant document in indexed by "A" (similar for "B"). To support this efficiently, we would need to introduce "filtering queries" and ensure they are applied after the "normal" queries (note: AdvancedQuery supports filtering queries).

The following example demonstrates buggy optimizations by CompositeIndex:

>>> from Products.PluginIndexes.CompositeIndex.CompositeIndex import CompositeIndex
>>> from pprint import pprint as pp
>>> 
>>> ci = CompositeIndex(
...   "ci",
...   extra=[
...     dict(id="fi", meta_type="FieldIndex", attributes=("fi",)),
...     dict(id="ki", meta_type="KeywordIndex", attributes=("ki",)),
...     ]
...   )
>>> 
>>> ci._length.change(1)  # avoid premature return of `make_query`
>>> 
>>> pp(
... ci.make_query(dict(fi=dict(query=(1,2,3)),
...                    ki=dict(query=(10, 11)),
...                    ))
... )
{'ci': {'query': ((('fi', 1), ('ki', 10)),
                  (('fi', 1), ('ki', 11)),
                  (('fi', 2), ('ki', 10)),
                  (('fi', 2), ('ki', 11)),
                  (('fi', 3), ('ki', 10)),
                  (('fi', 3), ('ki', 11)))}}
>>> 
>>> # The query below could be equivalent to the former one (if e.g.
... #  "fi" has keys 1, 2 and 3) but the range is ignored and only
... #  the two values 1 and 3 are taken into account
... pp(
... ci.make_query(dict(fi=dict(query=(1, 3), range="min:max"),
...                    ki=dict(query=(10, 11)),
...                    ))
... )
{'ci': {'query': ((('fi', 1), ('ki', 10)),
                  (('fi', 1), ('ki', 11)),
                  (('fi', 3), ('ki', 10)),
                  (('fi', 3), ('ki', 11)))}}
>>> 
>>> # In the query below, the "not" is ignored
... pp(
... ci.make_query(dict(fi=dict(query=(1,2,3)),
...                    ki={"query":(10, 11), "not":15},
...                    ))
... )
{'ci': {'query': ((('fi', 1), ('ki', 10)),
                  (('fi', 1), ('ki', 11)),
                  (('fi', 2), ('ki', 10)),
                  (('fi', 2), ('ki', 11)),
                  (('fi', 3), ('ki', 10)),
                  (('fi', 3), ('ki', 11)))}}
andbag commented 5 years ago
d-maurer commented 5 years ago

@andbag

* The query parameter "range" cannot be optmized by CompositeIndex and should not rewrite the query. The structure of the "composite key" does not support such queries.

You could have a look at "https://github.com/d-maurer/Products.ZCatalog/blob/resultset_intersection%2355/src/Products/PluginIndexes/unindex.py#L509". For an UnIndex, it transforms (among others) a range query into (the equivalent of) a normal or query. It essentially translates the range specification into the equivalent key set.

I do not pretend that CompositeIndex should optimize range queries - as the resulting key set may be huge. Nevertheless, it may be handy to remember that for an unspecialized index, a range query is simply an or query with a key set specified in an indirect way.

andbag commented 5 years ago

PR https://github.com/zopefoundation/Products.ZCatalog/pull/66 will fix the issue.