Open lizgzil opened 5 years ago
Lately, I am leaning towards the idea of not randomness so that the results are reproducible. I know that this introduces a bias but it sounds minimal.
This could be a nice investigation for 'ethics' whether this decision, i.e. picking the first when the similarity is the same, favors particular type of matches.
Note you can seed the random number generator to get deterministic results.
I think @hblanks probably has it here. Why don't we select randomly, but set the random seed to ensure that the random choice is reproducible?
Also I presume if we retain enough decimal places on the match, this is fairly unlikely to occur?
@lizgzil to review whether this is still an issue. If so, we can have a look at splitting it out and make it into actionable work
This issue is now in lines 80-90 of https://github.com/wellcometrust/reach/blob/master/reach/airflow/tasks/fuzzy_match_refs.py i.e.
res = self.es.search(
index=self.es_index,
body=body,
size=1
)
matches_count = res['hits']['total']['value']
if matches_count == 0:
return
best_match = res['hits']['hits'][0]
From the internet: "The top_hits aggregation uses the internal doc_id (in Lucene) as a tiebreak for documents with same sort values." (https://discuss.elastic.co/t/top-hits-query-with-same-score/107018)
So I think this means the top returned result is basically random (but I'm not sure how random sorting by the internal doc_id is) if there are multiple tie-breakers.
@jdu and @SamDepardieu do you think my interpretation above in bold is correct?
I'm going to suggest that we unassign this issue to float freely as a future thing to consider. As I say in the description we don't even have an estimate for how often this happens, so perhaps it's a very uncommon problem anyway!
I don't think the sentence in italic translates to the sentence in bold. What is says is that document id is used a last resort for sorting. Document id is not random. Most likely earlier documents have lower document id. If this is true in tie break situations we return the most recent.
I don't think this is very important tbh though.
If there are multiple matches with the same similarity the code picks the first one as the match. Should we pick this randomly or is it fine to always pick the first one in the list?
The case against picking randomly is because of reducibility - if it's random we get some different results every time it is run and this might lead people to distrust the results.
The case for randomness is if there is any bias in picking the first publication from the list, for example are the earlier publications in the list from older publications?
We could use a random seed to shuffle the publications data, hence making the results reducible and random, but if the publications data changes (which it will do as more publications come out) this will no longer give reducible results. Although perhaps the same can be said for picking the first in the list (will match publication X always be first in the list if there are more publications added)?
We don't currently have an estimate of how often this even happens.