wellcometrust / reach

Wellcome tool to parse references scraped from policy documents using machine learning
MIT License
26 stars 4 forks source link

Remove publications that have 0 citations from results #548

Closed jdu closed 4 years ago

jdu commented 4 years ago

We are removing publications that have 0 citations in policy from the citations results page.

This will help users navigate and increase their chances of finding relevant results.

Conversation/debate about this choice below for info only.


In order support showing results in citations without hiding results with no cited documents, we need to add a modifier to the citations ranking that alters the calculated rank of a citations result in order to surface results with citations higher than non-cited results.

aoifespenge commented 4 years ago

On this (but a side note) I saw you mentioned a calculated rank in your slack message. Could you explain how the search results in Discover Citations are ranked? I'd like to understand how it is decided in order what order the search results are listed @jdu

dd207 commented 4 years ago

I've had various conversations with @jdu and @aoifespenge so summarising in one place. When a publication has no citation results, we have some options for how we display this for users.

We could: 1) Remove the publication from the results page completely. 2) Acknowledge that we have found the publication, and show '0' citation matches in the results table There are pros and cons to both options.

1) Remove the publication from the results page completely

Pros: Clearer results only showing publications with citations.

Cons: Increased chance of users getting a 'no results' page and becoming confused/lose trust in the product.

2) Acknowledge that we have found the publication, and show '0' citation matches in the results table

Pros: We could make an assumption that showing 'feedback' will help users understand the workings of the product and increase trust.

Cons: If a search retrieves many publications it could confuse users to wade through results that do not have any citations. The product has been framed as a way to find citations so users could be confused as to why we are showing publications with no policy citations.


My preference is that we go with option 1 and remove publications without citations in policy documents from the results table.

After having a chat with @aoifespenge I agree that we need to frame this as an 'experiment' that we need to track to check if this is the right answer, and change if we've made the wrong decision.

I'm interested in other thoughts from the team, and if there are any other points we need to consider.

aoifespenge commented 4 years ago

I agree with you Dawn. I think it's best to keep things coherent. Super interesting choice to need to make and I think it will useful to be able to roll back on this if we find that it's not working for users - as you have said, let's be experimental. I've attempted to write out a 'hypothesis' for this. It's a bit wordy, so could do with some help on this:

If we only include search results in Discover Citations that have a policy document citation, then users will have a more coherent understanding of how the search works. We know this will be true when users can comprehend that the search results indicate matches of citations in policy documents, and not whether the publication exists in the database or not.

Hypothesis structure is: If I do [action] then [change]. I will know to this be true when [measurable outcome]

jdu commented 4 years ago

I don't think that hiding results will help them understand more. In fact I think it will do the complete opposite. For instance if I go into the application and search for "Malaria and Hypertension" right now it will return ~6 publications, none of which have citations. I immediately understand that those items have no citations from visual feedback, but I know that there are publications which match those search terms.

However, if we hide those results, the user will see a "No results found" page, and the assumption will be that there are no publications matching those search terms NOT that there are no publications matching those search terms WITH citations. Those are two very distinct mentalities with different outcomes in terms of how the user uses the system over the longer-term. If terms they are interested in, don't return results, they will assume we don't have anything about those types of publications and will be less likely to return, if we show them we have those publications but we don't have any citations, they are more likely to return to "check up" to see if they've been cited later on because they know we have publications matching their terms.

Even more specifically, what happens when a user searches for a specific publication by its PMCID, a publicaiton they know is in EPMC, which we list as one of our sources, and because that publication has no citations, we don't show it. For a user, that could create confusion, and they would be less apt to trust the system and it's results further down the line.

After a period of time, we may suddenly have a citation for a publication we didn't previously, but because the user originally searched and no publications were found matching their criteria, they are less likely to return to check the "status" of those publications to see if new citations have happened since their last search because they'll believe we just don't have that publication.

I think knowing that a publication has no citations is JUST as important as knowing that they do if you are trying to measure impact or reach of a given publication.

Modified Ranking

The modifier I was talking about, we calculate a rank based on how well the result matches to the search terms the user provides. The rank is a float value, as results become more distant from the search terms, that rank decreases down into the decimals, we can alter the rank with a modifier value to rank publications that have citations higher than publications which have none so that when the user is looking at the search results, the publications which have citations will occur first before publications which don't, despite non-cited publications potentially having a better match against the search terms. basically we take the maximum rank amongst all ranked results, and add that rank to all publications with citations to shift them above the uncited publications.

dd207 commented 4 years ago

Hi, Just had a look at the product where it looks like we're including publications with 0 citations.

I searched for a general term 'typhoid' and received 80+ pages of results, with the first page being publications with 0 citations, going as far back as 1898.

It's going to be confusing for users to wade through the information to find what they need.

Let's stick with only showing the research publications that have been cited in policy.

Screenshot 2020-06-09 at 15 09 17
aoifespenge commented 4 years ago
Screenshot 2020-06-09 at 15 18 43

See content which describes to the user that if they don't get a search result it is because it doesn't have a citation

jdu commented 4 years ago

@aoifespenge That should be reworded, as what's stated there doesn't align with what we're describing as the functionality, it says that papers that aren't in EPMC won't show up even if the paper would have citations.

What it needs to say is "If a paper is not in EPMC or has no citations associated with it then it will not be reported by Reach." or something similar.

aoifespenge commented 4 years ago

A simple way to give the user better feedback as to why they aren't getting the search result is to add this to 'no search results' message.

The modifier approach would lead to the overwhelming issue Dawn has pointed out, and it also changes conceptual model of 'Discover citations' to say "we give you search results from EPMC as well as search results of matches".

So I suggest including wording such as: No search results: This may be because 1) the publication is not in the EMPC database. Check by search for it on EMPC. 2) the publication is in the EPMC database but does not currently have any citations in policy documents.

Screenshot 2020-06-09 at 15 22 54

Finally, one thing we need to pay attention to is, although our concept is that EPMC is the database against which our docs are match (as outlined on 'About reach'), user didn't seem to comprehend this is in testing. They sometimes read it. I just re-watched a recording from the previous round, the user read this content, but still confused. We need to watch this when it is 'in the wild', through such things Hotjar recordings. And I think we need to do much more targeted usability testing.

aoifespenge commented 4 years ago

@aoifespenge That should be reworded, as what's stated there doesn't align with what we're describing as the functionality, it says that papers that aren't in EPMC won't show up even if the paper would have citations.

What it needs to say is "If a paper is not in EPMC or has no citations associated with it then it will not be reported by Reach." or something similar.

Yeah that makes sense. @dd207 ?

I think all this highlights that we have a complicated concept that we haven't come to a consensus on as a team. I think it's going to take time to smooth it out. But we'll get there.

Jeff and Dawn I have a clip from the last round - I'm going to share it on Slack with you, as this repo is not private. Worth a watch.

jdu commented 4 years ago

@aoifespenge I was about to suggest the same re results page wording, reword the no results found message for citations to clarify why there aren't any results. That won't help in a situation where they look for a general term and some publications show up, but others they were expecting don't as there wouldn't be a no results view in that case.

I think you should lean towards thinking that a user isn't likely to read any of this content, and is more likely to jump straight to the citations page and try a search before much more than a quick scan of the first few items on the pages. So the search page itself needs to have hinting towards what the functionality is. So the above is definitely in line with that.

aoifespenge commented 4 years ago

Jeff, ya you're right, only some users are the diligent reader type. Even when they read the content, the concepts are so confusing they still don't get it. So it should be in the search results error message. I was highlighting this content because it indicates the conceptual model we have been trying to design for, so if we add this modifier in then it conflicts with that content, which will piss off the diligent reader types who do read that content.

jdu commented 4 years ago

Longer-term maybe a feature where we show only cited publications but at the end of the results we have a link/statement like Google does sometimes "Some results we omitted due to lack of citations, click here to make them visible"?

aoifespenge commented 4 years ago

Yeah I think this links to OKR 2 around 'what is our search concept'. Hold on to that idea!

dd207 commented 4 years ago

What it needs to say is "If a paper is not in EPMC or has no citations associated with it then it will not be reported by Reach." or something similar.

Yep, fine with that. I'll add a ticket for it.

aoifespenge commented 4 years ago

@dd207 how about the part about changing what the search results error message says: https://github.com/wellcometrust/reach/issues/548#issuecomment-641337335

dd207 commented 4 years ago

I'm going to update the title and content of this issue to state 'removal of publications that have 0 citations' @jdu can you add the technical requirements for this?

aoifespenge commented 4 years ago

Think it was great to hash this issue out here and we should do more of this :)