wellcomecollection / catalogue-api

:crystal_ball: The API for searching the Wellcome Collection catalogue.
https://developers.wellcomecollection.org
MIT License
4 stars 0 forks source link

The self sub aggregation in a paired aggregation should have size=n, where n is the size of the list of filter terms #712

Closed paul-butcher closed 12 months ago

paul-butcher commented 1 year ago

Currently, the self sub aggregation is a copy of the main terms aggregation to which it corresponds, operating on a differently filtered set but with some changes:

Given that the purpose of this sub aggregation is to return the filter terms if they are absent from the otherwise filtered query, the bucket count should be constrained to the number of terms requested.

paul-butcher commented 1 year ago

This is probably blocked by https://github.com/wellcomecollection/catalogue-pipeline/issues/2436

paul-butcher commented 12 months ago

The problem here is that because we use labels in the filter, and it's possible for two distinct concepts to have the same label, we cannot guarantee that only that number of buckets is required.

That cannot be fixed until either:

paul-butcher commented 12 months ago

However, we could make an assumption that the labels are unique, and possibly force them to be if they are not

This may result in the document count being incorrect, because it will be choosing one of the versions and not the other, but

  1. it already is, because we have results like this: and whichever you choose will find 121 results
    • Wellcome Library (95)
    • Wellcome Library (26)
  2. We could consider grouping and summing them, then the number will be correct
  3. Even if we don't the experience will be improved. Currently, the website will display two identical filters being applied if you choose one that is duplicated in the filter list.

It may result in a filtered value being absent from the list if you filter by multiple values and multiple fields. This will happen in situations like this:

  1. Filter by subject A and contributor B and C.
  2. Neither B nor C are in the top contributors about subject A
  3. B is present in the catalogue with two different identifiers
  4. Both versions of B are sorted before C in the self aggregation
paul-butcher commented 12 months ago

Following a chat with everyone, the problem of non-unique labels is a front-end thing. It is correct for the API to return multiple homonymous buckets with different ids.

Making them unique by label in the drop-downs is a job for the website.

I thought of making the self aggregation return maximally n+2, to provide a buffer. However, given the reason I first thought of doing this (I suspect there is a performance hit from trying to find more results than can exist from an aggregation with min_doc_count: 0), it might be best to just return n.

paul-butcher commented 12 months ago

Having worked out what it is I want to do, I've now profiled it in the ES Dev Console, and I don't think it makes any difference whether the self aggregation is limited to n or 30. (tried contributor = Gardiner, James) on two different clusters (to avoid being confounded by caching) and the results were the same.

This is still logically the right thing to do, as it better communicates the desired result/purpose of the self subaggregation, but I'm putting it on the back burner, as there are other things that are likely to have a greater impact.

paul-butcher commented 12 months ago

I've made the change to the query and created a PR, but I don't think we should make this change now. https://github.com/wellcomecollection/catalogue-api/pull/713