wcmc-its / ReCiter

ReCiter: an enterprise open source author disambiguation system for academic institutions
Apache License 2.0
46 stars 25 forks source link

Update search to capture publications that are in PubMed but not Entrez #450

Closed paulalbert1 closed 3 years ago

paulalbert1 commented 3 years ago

Problem

Every so often a publication from 6-18 months ago will appear among the new suggestions for a person whose publications we track religiously (reviewing once a week or more).

An example of this is user = smkamins and PMID = 33365208. That publication just popped up in the pending publication list yesterday, but it hit PubMed in 2019. skaminsk has only 112 candidate publications. The exact same thing happened for rgcryst.

Why didn't we capture it way earlier?

Possible causes

The following are possible causes:

  1. "We don't look these individuals up frequently enough to surface this publication." → We review candidate pubs for skaminsk and rgcryst 1x/week or more.
  2. "Clustering is to blame." → Unlikely. Both rgcryst and smkamins have few low-scoring publications among candidate articles
  3. "These are low-scoring articles, and they just missed the cut-off." No - these are often high-scoring results.
  4. "Something is wacky with the dates." → This seems most likely as I will explain....

PMID = 33365208 is one of the roughly 5% of articles in PubMed that have a dateAddedToEntrez greater than the dateAddedToPubmed.

                <PubMedPubDate PubStatus="entrez">
                    <Year>2020</Year>
                    <Month>12</Month>
                    <Day>28</Day>
                    <Hour>12</Hour>
                    <Minute>6</Minute>
                </PubMedPubDate>
                <PubMedPubDate PubStatus="pubmed">
                    <Year>2019</Year>
                    <Month>1</Month>
                    <Day>1</Day>
                    <Hour>0</Hour>
                    <Minute>0</Minute>
                </PubMedPubDate>
                <PubMedPubDate PubStatus="medline">
                    <Year>2019</Year>
                    <Month>1</Month>
                    <Day>1</Day>
                    <Hour>0</Hour>
                    <Minute>1</Minute>
                </PubMedPubDate>
            </History>

Most of the time, the discrepancies between PubStatus="pubmed" vs. PubStatus="entrez" are minor, being off by a day or so. But, among that 5%, there is an even small subset of articles where it's months or more, and this is one of them.

When we do a date search using incremental lookup, our practice is to use the [DP] tag. For example:

("2020/12/28"[DP] : "2020/12/31"[DP]) AND kaminsky s[au]

The DP tag keys off of the date associated with PubStatus="entrez". Obviously, if it's null, we will miss out on these publications. In contrast, the [edat] tag keys off the PubStatus="pubmed". (It's confusing that it starts with an "e"!)

This returns zero results:

("1950/01/01"[DP] : "2019/01/01"[DP]) AND kaminsky s[au] AND 10.1080/21678707.2019.1684258[doi]

This returns one result:

("1950/01/01"[edat] : "2019/01/01"[edat]) AND kaminsky s[au] AND 10.1080/21678707.2019.1684258[doi]

Possible fixes

We could do one or both of these...

  1. Confirm that when retrievalRefreshFlag = ALL_PUBLICATIONS is set, it does not depend on the "DP" field. I don't think we're doing this, but this is the only reason I can think for why our monthly recon captures older pubs.
  2. Assuming it is not significantly slower, incremental lookups should do this...
    (("2019/01/01"[edat] : "2019/01/02"[edat]) OR ("2019/01/01"[dp] : "2019/01/02"[dp])) AND kaminsky s[au] 

Another possible fix, which I don't recommend. is to switch over to using the [EDAT] tag entirely. The reason why is that there are a subset of articles where the opposite problem is true.

For example, for 20228386, there is the following...

                <PubMedPubDate PubStatus="entrez">
                    <Year>2010</Year>
                    <Month>3</Month>
                    <Day>16</Day>
                    <Hour>6</Hour>
                    <Minute>0</Minute>
                </PubMedPubDate>
                <PubMedPubDate PubStatus="pubmed">
                    <Year>2010</Year>
                    <Month>3</Month>
                    <Day>17</Day>
                    <Hour>6</Hour>
                    <Minute>0</Minute>
                </PubMedPubDate>

That said, the difference in these cases tends to be of a day or less.

paulalbert1 commented 3 years ago

I do believe this is fixed.