The self sub aggregation in a paired aggregation should have size=n, where n is the size of the list of filter terms - Githubissues

wellcomecollection / catalogue-api

:crystal_ball: The API for searching the Wellcome Collection catalogue.

https://developers.wellcomecollection.org

MIT License

4 stars 0 forks source link

The self sub aggregation in a paired aggregation should have size=n, where n is the size of the list of filter terms #712

Closed paul-butcher closed 12 months ago

paul-butcher commented 1 year ago

Currently, the self sub aggregation is a copy of the main terms aggregation to which it corresponds, operating on a differently filtered set but with some changes:

It can return buckets of zero size
It only returns buckets that match the filter term(s)

Given that the purpose of this sub aggregation is to return the filter terms if they are absent from the otherwise filtered query, the bucket count should be constrained to the number of terms requested.

paul-butcher commented 1 year ago

~~This is probably blocked by https://github.com/wellcomecollection/catalogue-pipeline/issues/2436~~

paul-butcher commented 12 months ago

The problem here is that because we use labels in the filter, and it's possible for two distinct concepts to have the same label, we cannot guarantee that only that number of buckets is required.

That cannot be fixed until either:

We start using ids in the filters.
We implement the sameAs feature in the Concepts Pipeline and start stamping it back onto Catalogue items.

paul-butcher commented 12 months ago

However, we could make an assumption that the labels are unique, and possibly force them to be if they are not

This may result in the document count being incorrect, because it will be choosing one of the versions and not the other, but

it already is, because we have results like this: and whichever you choose will find 121 results
- Wellcome Library (95)
- Wellcome Library (26)
We could consider grouping and summing them, then the number will be correct
Even if we don't the experience will be improved. Currently, the website will display two identical filters being applied if you choose one that is duplicated in the filter list.

It may result in a filtered value being absent from the list if you filter by multiple values and multiple fields. This will happen in situations like this:

Filter by subject A and contributor B and C.
Neither B nor C are in the top contributors about subject A
B is present in the catalogue with two different identifiers
Both versions of B are sorted before C in the self aggregation

paul-butcher commented 12 months ago

Following a chat with everyone, the problem of non-unique labels is a front-end thing. It is correct for the API to return multiple homonymous buckets with different ids.

Making them unique by label in the drop-downs is a job for the website.

I thought of making the self aggregation return maximally n+2, to provide a buffer. However, given the reason I first thought of doing this (I suspect there is a performance hit from trying to find more results than can exist from an aggregation with min_doc_count: 0), it might be best to just return n.

paul-butcher commented 12 months ago

Having worked out what it is I want to do, I've now profiled it in the ES Dev Console, and I don't think it makes any difference whether the self aggregation is limited to n or 30. (tried contributor = Gardiner, James) on two different clusters (to avoid being confounded by caching) and the results were the same.

This is still logically the right thing to do, as it better communicates the desired result/purpose of the self subaggregation, but I'm putting it on the back burner, as there are other things that are likely to have a greater impact.

paul-butcher commented 12 months ago

I've made the change to the query and created a PR, but I don't think we should make this change now. https://github.com/wellcomecollection/catalogue-api/pull/713