Fix catalog plan for performance

mamico commented 2 years ago

In a large Plone website with a ZCatalog containing more than 400K objects, we found poor performance in catalog queries. This is probably not surprising.

But upon further analysis, I saw that many long-running queries were about searching on effectiveRange (type DateRangeIndex) https://github.com/zopefoundation/Products.ZCatalog/blob/5a3272f5e5f0429910ecdc91bf9e2702bc276e7d/src/Products/PluginIndexes/DateRangeIndex/DateRangeIndex.py#L282-L285

The problem is that DateRangIndex (but also other indexes) behaves very differently if it is called as first (with resultset equal to None) or not. The main goal of the catalog plan is to solve the above problem by ordering the queries to the indexes from fastest to slowest and executing the queries in that order. So, I was surprised that sometimes the effectiveRange was executed first, resulting in a large performance penalty.

This is what I hypothesized: suppose we have a search with path (type PathIndex) and effectiveRange (type DateRangeIndex). In the first time, without a plan, the indexes are sorted alphabetically, so effectiveRange is the first one executed. In subsequent searches, the plan, correctly, sorts the indexes and the path query is executed first, followed by the slower effectiveRange.

But if, at some point, the path query returns no results, effectiveRange will not be executed https://github.com/zopefoundation/Products.ZCatalog/blob/5a3272f5e5f0429910ecdc91bf9e2702bc276e7d/src/Products/ZCatalog/Catalog.py#L638-L639 and in the query plan, for an index that does not have a benchmark in the current search, the new benchmark saved will be (0, 0, False) (i.e., 0s duration, 0 hits and the last boolean means that the index does not filter results) https://github.com/zopefoundation/Products.ZCatalog/blob/5a3272f5e5f0429910ecdc91bf9e2702bc276e7d/src/Products/ZCatalog/plan.py#L306-L307 With this benchmark effectiveRange becomes the first index used in the next search, resulting in poor performance.

In the proposed implementation, I changed this last behavior: if the search was not performed for an index, the benchmark used remains the previous one, if it exists.

I think there is still more room for improvement in the code, studying more sophisticated query planners. But, of course, adding these elements will also add more complexity.

In the meantime, in all the real-world situations where I have tested the implementation proposed here, I have experienced a terrific improvement in performance.

p.s. For easily try the code here, I have just released a package https://pypi.org/project/experimental.catalogplan/ with a monkey patch that implements the same changes proposed here.

davisagli commented 2 years ago

@icemac @dataflake Could we have a new release, please?

mamico commented 2 years ago

@icemac @dataflake Could we have a new release, please?

Please, just a sec, I've another PR that, after review, we can add, or not, to the next release. I will submit in the next hours.

icemac commented 2 years ago

I just created a release, see https://pypi.org/project/Products.ZCatalog/6.3/.

zopefoundation / Products.ZCatalog

Fix catalog plan for performance #138