wellcomecollection / rank

A CLI for measuring search relevance
0 stars 1 forks source link

Define a more reliable case-sensitivity test #103

Closed paul-butcher closed 5 months ago

paul-butcher commented 5 months ago

The Order test: Capitalised match appears before lower case match is unreliable across indices.

I believe that this is because there is so much content about AIDS whose results are more-or-less equally ranked for a query on that term, that there is no guarantee that any given document containing that term will be ranked within the top 100.

Any changes to mappings or analysis, or even just a different version of Elasticsearch could easily shuffle the order of results such that one or both of the expected records drops out of the top 100.

Furthermore, because of the surfeit of AIDS content, I believe that this test is not really examining what it purports to. A search for aids rather than AIDS still seems to preferentially return AIDS records, probably because the case-insensitive match is finding it in more fields, outweighing the boost applied to the case-sensitive match.

paul-butcher commented 5 months ago

Slack

paul-butcher commented 5 months ago

Perhaps aids diagnosis vs. AIDS diagnosis

I think this collocation would allow records for the two meanings (aids for diagnosing any ailment, vs how to diagnose AIDS) without significantly favouring one or the other (e.g. in a contest between a judicial hearing about AIDS, vs hearing aids, it's likely that the hearing aids will win in a search for "aids hearing" regardless of capitalisation)