zopefoundation / Products.ZCatalog

Zope's indexing and search solution.
Other
5 stars 22 forks source link

sort-on multiple indexes is broken #81

Open volkerjaenisch opened 5 years ago

volkerjaenisch commented 5 years ago

When doing a sort-on on multiple indexes the _sort_iterate_resultset sort_algorithm has to be used (independent of the limit settings).

The limit calculation in Catalog.py (intorduced in commit 939c8c4091bbaba8ae283f6e65a22b79de5ddefc)

def _sort_limit_arguments(self, query, sort_index, reverse, limit)

does not respect the possibility that limit is set to None to choose _sort_iterate_result set.

While in the algorithm choice code the test on limit is None (uselessly) persists:

       # Choose one of the sort algorithms.
        if iterate_sort_index:
            sort_func = self._sort_iterate_index
        elif limit is None or (limit * 4 > rlen):
            sort_func = self._sort_iterate_resultset
        elif first_reverse:
            sort_func = self._sort_nbest
        else:
            sort_func = self._sort_nbest_reverse

The n_best/n_best_reverse do the limiting on the first index, only. And then the sorting over the additional indexes on the already limited result set which leads to incorrect results.

The fix to the issue is quite simple (just check if we have more than on sort_index)

        # Choose one of the sort algorithms.
        if iterate_sort_index:
            sort_func = self._sort_iterate_index
        elif limit is None or second_indexes or (limit * 4 > rlen):
            sort_func = self._sort_iterate_resultset
        elif first_reverse:
            sort_func = self._sort_nbest
        else:
            sort_func = self._sort_nbest_reverse

I also noticed that some Plone packages in particular plone.app.querystring do no longer even mention (and wrongly document in the code) the possibility to search/sort on more than a single index.

https://github.com/plone/plone.app.querystring/issues/92

icemac commented 5 years ago

Thank you for documenting the issue. Could you please create a pull request with your suggested solution?

volkerjaenisch commented 5 years ago

@icemac Thank you for taking this problem seriously.

I would love to send in a simple pull request solving this issue. But ... IMHO the tests for zcatalog at least for TestCatalogSortBatch are complete nonsense. There is already a testcase for this particular usecase returning true.

From debugging the tests I learned the following:

So we have the following index test data (only att1 shown for simplicity), with a fixed value of 'att1': Obj1 : num : 73, att1 : 'att1' Obj3 : num : 1, att1 : 'att1' Obj2 : num : 51, att1 : 'att1' ... Obj100 : num : 17, att1 : 'att1'

So there is no wonder if a test-routine like

    def testSortLimitViaBatchingArgsRightMiddleSortOnTwoSecond(self):
        catalog = self._make_one()
        query = dict(att1='att1', sort_on=('att1', 'num'),
                     sort_order=('', 'reverse'), b_start=48, b_size=8)
        result = catalog(query)
        self.assertEqual(result.actual_result_count, 100)
        self.assertEqual([r.num for r in result], list(range(51, 43, -1)))

always returns true.

query = dict(att1='att1', sort_on=('att1', 'num')

att1='att1' selects all entries. The sort_on 'att1' is a null operation since the index has exactly one value. The the sort_on 'num' returns then a quite predictable result: list(range(51, 43, -1)). In fact there is no real testing of multiple indexes at all.

I modified the testing enhance the visibility of the problem : https://github.com/Inqbus/Products.ZCatalog/tree/bettertests

I added a Dummy2 class:


    def __init__(self, num):
        self.num = num
        if isinstance(num, int) and (self.num % 10) == 0:
            self.ends_in_zero = True
        if isinstance(num, int) and self.num >= 50 :
            self.att1 = 'other'

which attributes all num >= 50 to an index value of att1 = 'other'.

Obj1 : num : 73, att1 : 'other' Obj3 : num : 1, att1 : 'att1' Obj2 : num : 51, att1 : 'other' ... Obj100 : num : 17, att1 : 'att1'

I also introduced two new tests, which make use of this class:

    def test_sort_on_two_limit_10(self):
        catalog = self._make_one(cls=Dummy2)
        upper = self.upper
        a = catalog(sort_on=('att1', 'num'), att1='att1',  b_start=0, b_size=10)
        self.assertEqual(len(a), 10)
        for x in range(10):
            self.assertEqual(a[x].num, x)

    def test_sort_on_two_limit_50(self):
        catalog = self._make_one(cls=Dummy2)
        upper = self.upper
        a = catalog(sort_on=('att1', 'num'), att1='att1',  b_start=0, b_size=50)
        self.assertEqual(len(a), 50)
        for x in range(50):
            self.assertEqual(a[x].num, x)

These tests are identically in structure and only differ by the batch size 'b_size' and the expected results. b_size=10 fails b_size=50 works.

In the both test cases the data is filtered by att1='att1' which gives us the objects with num below 50. Then the data is first sorted after index 'att1' which is a null operation. Then the data should be sorted after num which should give us the the objects sorted on num:0..49. Then there should be sort of slice to give us the appropriate number of objects. In the b_site=10 case the objects we would expect the objects with num= 0..10. In the b_size=50 case the objects with num=0..49. But this is only the case for b_size=50.

I volunteer to fix these problems, but I need some help from the community.

Cheers, Volker

jugmac00 commented 5 years ago

"Never trust a test unless you have seen it fail" Tim Ottinger

Hi Volker,

great to see you want to contribute.

In order to contribute to any Zope or Plone project, you have to sign the conributor agreement: https://www.zope.org/developer/becoming-a-committer.html

Once you have done this (maybe you have already), it is a good idea to create a pull request - especially as you already wrote some code.

Creating a pull request triggers TravisCI to run all tests - so the reviewers already know the tests pass :-)

You wrote you need the help of the community. What exactly do you need help with?

I started contributing to the Zope ecosystem not that long ago and wrote a blog post about my experience and took notes of the many unwritten rules - maybe it helps: https://jugmac00.github.io/blog/face-to-face-with-earl-zope/

Best, Jürgen

volkerjaenisch commented 5 years ago

Dear @jugmac00 ! I sent the agreement. Thanks for the Starter-Link. I contributed to Collective for a while so I hope to deal with Zope/Plone as well. Concerning the "help of the community": This code is one of the most crucial parts of Zope/Plone and I do not like to break it more than it is already broken. I suppose that the algorithms for n-best/n-best-reverse are not designed to be broken for multiple indexes - they simply not have taken them into account. So there are 2/3 of the sort algorithms in ZCatalog broken (for multiple indexes).

Cheers, Volker

jugmac00 commented 5 years ago

@hannosch is not very active lately, but others - especially @icemac - will have a look - given time and availablity, and also partly personal interest.

I even did not know that you can sort on multiple indexes. From a quick scan, it does not seem to be mentioned in the documentation https://zope.readthedocs.io/en/latest/zopebook/SearchingZCatalog.html

Updating documentation is something I can step in - once this is fixed and I have some spare time.

volkerjaenisch commented 5 years ago

@jugmac00

I even did not know that you can sort on multiple indexes. From a quick scan, it does not seem to be mentioned in the documentation https://zope.readthedocs.io/en/latest/zopebook/SearchingZCatalog.html

Yep. That is the core of the problem: Information leakage. The original ZCatalog supported multiple indexes. This has gone forgotten and the developers promoting the code have not taken into account this feature. At the end of the food chain plone.app.querystring is broken in a way that it no longer support multible indexes at all. Therefore I urged for the originators of the code to have a discussion. The changes of Hanno (which are great for the performance and the readability of the code) did damage the multi index features of ZCatalog.

If we like to fix this mess:

Cheers, Volker

d-maurer commented 5 years ago

Jürgen Gmach wrote at 2019-8-18 15:40 -0700:

... I even did not know that you can sort on multiple indexes. From a quick scan, it does not seem to be mentioned in the documentation https://zope.readthedocs.io/en/latest/zopebook/SearchingZCatalog.html

ZCatalog has learned this only fairly recently (much more recently than the Zope book).

volkerjaenisch commented 5 years ago

Dear Community!

I need some insight: What is/was the aim of sort_limit exactly?

From the Zope book the sort_limit is a hint to the Catalog how many datasets are expected. This may have act as a shortcut to terminate the query when enough datasets were found. This optimization is correct if the result-set is only sorted after one index (which was as @d-maurer stated the state of the Zope-Book: There was only one index to sort after.)

Currently we have a more complex case:

  1. There was a time when sort_limit may have been None. There are still some now useless if-statements in the code checking for sort_limit is None.
  2. In this case I suppose was expected that all sort optimization is futile and a sorting over all the result sets is needed
  3. In general if more than one index is used ALL the result datasets have to be taken into account for sorting.
  4. Currently sort_limit is calculated from the b_start and b_size parameters in any case. This breaks (1,2,3)

I propose for discussion :

I am currently working on an advanced testing for ZCatalog to find the uncovered cases.

These test are IMHO more readable

    def test_sort_on_two_reverse(self):
        catalog = self._make_one()

        reference = [
            ('geralt', 19, 'smith'),
            ('geralt', 14, 'meier'),
            ('fran', 9, 'lopez'),
            ('ethel', 18, 'smith'),
            ('ethel', 13, 'smith'),
            ('ethel', 12, 'meier'),
            ('ethel', 6, 'meier'),
            ('ethel', 1, 'lopez'),
            ('dolores', 11, 'lopez'),
            ('dolores', 10, 'lopez'),
            ('dolores', 7, 'smith'),
            ('cinder', 17, 'lopez'),
            ('cinder', 5, 'lopez'),
            ('cinder', 3, 'meier'),
            ('bert', 15, 'smith'),
            ('bert', 8, 'smith'),
            ('bert', 0, 'meier'),
            ('abel', 16, 'lopez'),
            ('abel', 4, 'meier'),
            ('abel', 2, 'meier')
        ]

        a = catalog(sort_on=('first', 'num'), all='all',
                    sort_order='reverse')
        self.check_result_limit(a, reference, self.upper, ('first', 'num'))

        for N in range(1,self.upper):
            a = catalog(sort_on=('first', 'num'), all='all', sort_order='reverse', b_start=0, b_size=N)
            self.check_result_limit(a, reference, N, ('first', 'num'))

since the reference is explicitly given.

I do not like tests involving randomness. So I followed the second Python Principal "Explicit is better than implicit."

Some of these new tests are failing. This gives us a clean start for the work on the code.

I am not done with the transformation of the old test code to the more explicit form, so please stay tuned.

Cheers, Volker

d-maurer commented 5 years ago

Volker Jaenisch (PhD) wrote at 2019-8-20 16:45 -0700:

... What is/was the aim of sort_limit exactly?

From the Zope book the sort_limit is a hint to the Catalog how many datasets are expected. This may have act as a shortcut to terminate the query when enough datasets were found. This optimization is correct if the result-set is only sorted after one index (which was as @d-maurer stated the state of the Zope-Book: There was only one index to sort after.)

Its main purpose is to reduce sorting time: If "sort_limit" is small, "nsmallest" is significantly faster than full sorting.

Currently we have a more complex case:

  1. There was a time when sort_limit may have been None. There are still some now useless if-statements in the code checking for sort_limit is None.
  2. In this case I suppose was expected that all sort optimization is futile and a sorting over all the result sets is needed

The catalog has two optimization stragegies for sorting:

1 using "sort_limit"; if "sort_limit" is not specified, full sorting is required

2 choose between sorting via the index (index small compared to result set) or via the document values (index large compared to result set). This is independent from "sort_limit".

  1. In general if more than one index is used ALL the result datasets have to be taken into account for sorting.

I do not think so. In fact sorting over several indexes is almost identical to sorting over a single index: your sort values are just tuples rather than individual values.

The second optimization approach becomes a bit more complex but it, too, can be adapted to several indexes (I do in Products.AdvancedQuery).

  1. Currently sort_limit is calculated from the b_start and b_size parameters in any case. This breaks (1,2,3)

I do not doubt that you have evidence that something currently breaks. But, in principle, it is correct to compute sort_limit a b_start + b_size (plus some small number to account for a slightly unprecise batch). Note that sort_limit specifies that you want at least that number of hits correctly sorted (and do not care about the sort order of further hits). Thus, if your batch shows hits n through m, then m is an appropriate value for sort_limit.

I propose for discussion :

  • The meaning of sort_limit has to be defined anew and well documented.

If sort_limit is specified, then the first sort_limit elements of the result list must be the first elements of the search result sorted according to the sort specification; later elements can be delivered in any order.

This is used as a sort optimization. It allows to use the nsmallest sort algorithm which is considerably faster than full sorting for small n.

  • The default for sort_limit=None shall return the correct search/sort behavior for any combination of search or sort indexes.

Yes.

volkerjaenisch commented 5 years ago

@d-maurer Thank you for shedding some light on this dark piece of Zope and the intentions the code should have.

Its main purpose is to reduce sorting time: If "sort_limit" is small, "nsmallest" is significantly faster than full sorting. The catalog has two optimization stragegies for sorting:

1 using "sort_limit"; if "sort_limit" is not specified, full sorting is required

2 choose between sorting via the index (index small compared to result set) or via the document values (index large compared to result set). This is independent from "sort_limit".

OK. Currently 2 is broken. It is broken since it depends on sort_limit (in contradiction to your statement). The choice of the algorithm is done solely by looking at (a precalculated) "limit":

Catalog.py: 644

        # Try to deduce the sort limit from batching arguments.
        b_start, b_size, limit, sort_report_name = self._sort_limit_arguments(
            query, sort_index, reverse, limit)

This is not a try!: In _sort_limit_arguments limit is always set to b_start + b_size. Sort_limit=None is not respected.

Catalog.py :1000

        # Special first condition, as it changes post-processing.
        iterate_sort_index = (
            merge and limit is None and (
                rlen > (len(sort_index) * (rlen / 100 + 1))))

        # Choose one of the sort algorithms.
        if iterate_sort_index:
            sort_func = self._sort_iterate_index
        elif limit is None or (limit * 4 > rlen):
            sort_func = self._sort_iterate_resultset
        elif first_reverse:
            sort_func = self._sort_nbest
        else:
            sort_func = self._sort_nbest_reverse

The decision depends crucially on (sort_)limit

  1. In general if more than one index is used ALL the result datasets have to be taken into account for sorting. I do not think so. In fact sorting over several indexes is almost identical to sorting over a single index: your sort values are just tuples rather than individual values.

Sorry but "almost" is IMHO not good enough for a core piece of software.

The second optimization approach becomes a bit more complex but it, too, can be adapted to several indexes (I do in Products.AdvancedQuery).

Fine. Please help me to do this for ZCatalog.

  1. Currently sort_limit is calculated from the b_start and b_size parameters in any case. This breaks (1,2,3)

I do not doubt that you have evidence that something currently breaks. But, in principle, it is correct to compute sort_limit a b_start + b_size (plus some small number to account for a slightly unprecise batch).

As stated before. IMHO this is in principal not correct. I can construct easily cases where this kind of "algorithm" is off by an arbitrary number of datasets. The choice of the "small number" you mention depends crucially on the nature of your data and is not a constant like Pi.

The question we have to ask our self is: Do we want to rely on software that is 99% correct or do we like to have software which is 100% reliable. Nothing against optimizations, but they have IMHO to be a users choice to turn on.

Note that sort_limit specifies that you want at least that number of hits correctly sorted (and do not care about the sort order of further hits). Thus, if your batch shows hits n through m, then m is an appropriate value for sort_limit.

This optimization is absolutely correct IF sort_limit is specified. But absolutely wrong if sort_limit is None or not specified. IMHO when sort_limit is not specified or None, a failsafe variant of the algorithm has to take place.

Ceterum censeo: The tests have to be improved and better documented. A 100% test coverage says nothing if your test data is crap. And not documented tests shy away users trying to help us finding errors. Then the code has to be fixed to comply the tests.

d-maurer commented 5 years ago

Volker Jaenisch (PhD) wrote at 2019-8-21 02:17 -0700:

@d-maurer Thank you for shedding some light on this dark piece of Zope and the intentions the code should have.

Its main purpose is to reduce sorting time: If "sort_limit" is small, "nsmallest" is significantly faster than full sorting. The catalog has two optimization stragegies for sorting:

1 using "sort_limit"; if "sort_limit" is not specified, full sorting is required

2 choose between sorting via the index (index small compared to result set) or via the document values (index large compared to result set). This is independent from "sort_limit".

OK. Currently 2 is broken. It is broken since it depends on sort_limit (in contradiction to your statement). The choice of the algorithm is done solely by looking at (a precalculated) "limit":

Well, 2 can obviously make use of an additional limit after it has decided about the sorting approach. If implemented correctly, both approaches will give the same result. Thus, it is only a matter of efficiency, not correctness.

Catalog.py: 644

       # Try to deduce the sort limit from batching arguments.
       b_start, b_size, limit, sort_report_name = self._sort_limit_arguments(
           query, sort_index, reverse, limit)

This is not a try!: In _sort_limit_arguments limit is always set to b_start + b_size. Sort_limit=None is not respected.

This might be justified (apart from forgetting special "batching" features which mean that the batch size can vary slightly): If you use the catalog in a "batching context", then, likely, your are only interested in the "batching window" of the result list. From this, you can deduce a sort limit (even if not specified). If you are interested in a fully sorted result list, you likely do not have batching parameters.

Whether this arguemnt is good highly depends on how the catalog determines the batching parameters. If it looks at explicitly provided arguments, this is okay; if it looks at the request (and such for implicit arguments), it is doubtfull because a batching request might employ a secondary search not governed by the request parameters.

Catalog.py :1000

       # Special first condition, as it changes post-processing.
       iterate_sort_index = (
           merge and limit is None and (
               rlen > (len(sort_index) * (rlen / 100 + 1))))

       # Choose one of the sort algorithms.
       if iterate_sort_index:
           sort_func = self._sort_iterate_index
       elif limit is None or (limit * 4 > rlen):
           sort_func = self._sort_iterate_resultset
       elif first_reverse:
           sort_func = self._sort_nbest
       else:
           sort_func = self._sort_nbest_reverse

The decision depends crucially on (sort_)limit

It may not fully exploit the optimization potential - but nothing bad should happen (as explained above).

  1. In general if more than one index is used ALL the result datasets have to be taken into account for sorting.

Do you want to say, that for sorting over several indexes "iterate_sort_index" becomes impossible? That would not be true (Products.AdvancedQuery does precisely this -- correctly, as far as I know).

Of course, "iterate_sort_index", too, uses the complete result -- just indirectly, without explicitly looking at the document values but using index information to deduce this value for many documents at the same time.

I do not think so. In fact sorting over several indexes is almost identical to sorting over a single index: your sort values are just tuples rather than individual values.

Sorry but "almost" is IMHO not good enough for a core piece of software.

I explained the "almost" in the following sentence: replace the single value by the value tuple; even a "core piece of software" can make use of this kind of similarity.

The second optimization approach becomes a bit more complex but it, too, can be adapted to several indexes (I do in Products.AdvancedQuery).

Fine. Please help me to do this for ZCatalog.

Look at Products.AdvancedQuery (you find a source distribution on PyPI).

Products.AdvancedQuery has no "sort_limit" (as it does sorting fully lazily) but it, too, uses the two approaches "iterate_index" and "iterate_document_values".

As stated before. IMHO this is in principal not correct.

I argued above that you are likely wrong.

I can construct easily cases where this kind of "algorithm" is off by an arbitrary number of datasets. The choice of the "small number" you mention depends crucially on the nature of your data and is not a constant like Pi.

If you perform a "batched search", then you should only be interested in the results in the "batch window" - and then the order of hits above this window is not relevant (which means, you can set sort_limit to the first index above the window). Do not use batching, if you are interested in results outside the "batch window".

The question we have to ask our self is: Do we want to rely on software that is 99% correct or do we like to have software which is 100% reliable. Nothing against optimizations, but they have IMHO to be a users choice to turn on.

An optimization should still produce a correct result. If it does, then their is no need that users turn it on (most users will not have sufficient knowledge to make a good decision about most optimizations).

IF the catalog determines the batching parameters implicitly (I fear, it does), then I consider this a bug (as a batching request might perform secondary (not batched) searches).

... This optimization is absolutely correct IF sort_limit is specified. But absolutely wrong if sort_limit is None or not specified. IMHO when sort_limit is not specified or None, a failsafe variant of the algorithm has to take place.

At least, I have no objection to drop an implicit determination of a sort limit from implicit batching parameters.

volkerjaenisch commented 5 years ago

Added a pull request for the better testing on multiple index cases: #82