wellcometrust / reach

Wellcome tool to parse references scraped from policy documents using machine learning
MIT License
26 stars 4 forks source link

Consider deduplicating repeated references from the same policy document #182

Open lizgzil opened 5 years ago

lizgzil commented 5 years ago

We currently report that there can be multiple citations of the same reference in one policy document. This can happen when there are multiple references sections in a policy document.

In https://apps.who.int/iris/bitstream/handle/10665/250693/9789241549875-eng.pdf?sequence=11&isAllowed=y the reference "Global, regional, and national incidence, prevalence, and years lived with disability for 301 acute and chronic diseases and injuries in 188 countries, 1990–2013: a systematic analysis for the Global Burden of Disease Study 2013" is repeated 6 times

Do we want to deduplicate or keep them all in our output?

nsorros commented 5 years ago

I think this requires further discussion. I am adding @dd207 and @aoifespenge as this is a product feature and it may also have consequences in how many citations a researcher gets.

It seems that each policy document should only be able to give one citation to a publication but I know that @lizgzil feels otherwise.

aoifespenge commented 4 years ago

Had this as a note elsewhere: Quote from participant - "but what you shouldn't have is publication one publication one being found in the same document listed twice. What I mean is that let’s say in NICE committee papers, there are often a lot of annexes and those annexes have multiple sections and each section has his own reference section so when your reach tool goes out they’re counting each of those as individual matches so you're getting 10 matches for this one publication but it's really one because as one document "

dd207 commented 4 years ago

Agree that this is important and should be addressed. We don't want to mislead users about the citation count of research publications.

I've labelled the issue under the theme of 'trust' - users have confidence in the product and know what they can get from it so we can prioritise when to work on it.