wellcometrust / reach

Wellcome tool to parse references scraped from policy documents using machine learning
MIT License
25 stars 4 forks source link

Consider fuzzy match randomness #165

Open lizgzil opened 5 years ago

lizgzil commented 5 years ago

If there are multiple matches with the same similarity the code picks the first one as the match. Should we pick this randomly or is it fine to always pick the first one in the list?

The case against picking randomly is because of reducibility - if it's random we get some different results every time it is run and this might lead people to distrust the results.

The case for randomness is if there is any bias in picking the first publication from the list, for example are the earlier publications in the list from older publications?

We could use a random seed to shuffle the publications data, hence making the results reducible and random, but if the publications data changes (which it will do as more publications come out) this will no longer give reducible results. Although perhaps the same can be said for picking the first in the list (will match publication X always be first in the list if there are more publications added)?

We don't currently have an estimate of how often this even happens.

nsorros commented 5 years ago

Lately, I am leaning towards the idea of not randomness so that the results are reproducible. I know that this introduces a bias but it sounds minimal.

This could be a nice investigation for 'ethics' whether this decision, i.e. picking the first when the similarity is the same, favors particular type of matches.

hblanks commented 5 years ago

Note you can seed the random number generator to get deterministic results.

ivyleavedtoadflax commented 4 years ago

I think @hblanks probably has it here. Why don't we select randomly, but set the random seed to ensure that the random choice is reproducible?

ivyleavedtoadflax commented 4 years ago

Also I presume if we retain enough decimal places on the match, this is fairly unlikely to occur?

kristinenielsen commented 4 years ago

@lizgzil to review whether this is still an issue. If so, we can have a look at splitting it out and make it into actionable work

lizgzil commented 4 years ago

This issue is now in lines 80-90 of https://github.com/wellcometrust/reach/blob/master/reach/airflow/tasks/fuzzy_match_refs.py i.e.

res = self.es.search(
    index=self.es_index,
    body=body,
    size=1
    )
matches_count = res['hits']['total']['value']
if matches_count == 0:
    return

best_match = res['hits']['hits'][0]

From the internet: "The top_hits aggregation uses the internal doc_id (in Lucene) as a tiebreak for documents with same sort values." (https://discuss.elastic.co/t/top-hits-query-with-same-score/107018)

So I think this means the top returned result is basically random (but I'm not sure how random sorting by the internal doc_id is) if there are multiple tie-breakers.

@jdu and @SamDepardieu do you think my interpretation above in bold is correct?

I'm going to suggest that we unassign this issue to float freely as a future thing to consider. As I say in the description we don't even have an estimate for how often this happens, so perhaps it's a very uncommon problem anyway!

nsorros commented 4 years ago

I don't think the sentence in italic translates to the sentence in bold. What is says is that document id is used a last resort for sorting. Document id is not random. Most likely earlier documents have lower document id. If this is true in tie break situations we return the most recent.

I don't think this is very important tbh though.