wcmc-its / ReCiter

ReCiter: an enterprise open source author disambiguation system for academic institutions
Apache License 2.0
45 stars 25 forks source link

Create authorAffiliationScoringStrategy #47

Closed michaelbales1 closed 5 years ago

michaelbales1 commented 9 years ago

Overview

With this scoring strategy, we're trying to account for the extent to which affiliation of all authors affects the likelihood a given targetAuthor authored an article.

To do this, we need to ask and answer several questions.

  1. Which sources are we using to make the match?

    • Scopus - does institutional disambiguation; provides affiliations as numeric codes (e.g., 6007997)
    • PubMed - affiliations are just strings
  2. Which affiliation(s) are we considering?

    • targetAuthor
    • non-targetAuthor
  3. What type of match is this?

    • explicitly defined for the individual, e.g., Dr. X got an undergraduate degree from Georgetown University, did her residency at Montefiore, etc.
    • explicitly defined for the institution, e.g., Weill Cornell faculty frequently co-author papers with individuals from Hospital for Special Surgery
    • match was not attempted because there was no available affiliation data
    • match was attempted but failed

About Scopus data

There are currently 276,666 institutions in the Identity table, which represents 3,861 unique institutions. This comes from several sources, which use a controlled vocabulary.

We've looked up the Scopus Institution ID for the 1,786 institutions that are most often cited as being a current or historical affiliation. This collectively represents 273,006 affiliations. In other words, ~99% of the time we can predict what the Scopus Institution ID could be. Note that a given institution such as Weill Cornell might have multiple institution IDs.

Values in application.properties

targetAuthor-institutionalAffiliation-matchType-positiveMatch-individual-score: 3
targetAuthor-institutionalAffiliation-matchType-positiveMatch-institution-score: 1.5
targetAuthor-institutionalAffiliation-matchType-null-score: 0
targetAuthor-institutionalAffiliation-matchType-noMatch-score: -2

nonTargetAuthor-institutionalAffiliation-weight: 0.5
nonTargetAuthor-institutionalAffiliation-maxScore: 3

homeInstitution-scopusInstitutionIDs: 60007997, 60019868, 60000247, 60072750, 60109878

homeInstitution-keywords: weill|cornell, weill|medicine, cornell|medicine, cornell|medical, weill|medical, weill|bugando, weill|graduate, cornell|presbyterian, weill|presbyterian, 10065|cornell, 10065|presbyterian, 10021|cornell, 10021|presbyterian, weill|qatar, cornell|qatar, @med.cornell.edu, @qatar-med.cornell.edu

institutionStopwords: of, the, for, and, to

collaboratingInstitutions-scopusInstitutionIDs: 60010570, , 60025849, 60012732, 60018043, 60008981, 60022875, 60019970, 60025879, 60009343, 60009656, 60072743, 60072746, 60104769, 60012981, 60000764, 60004670, 60014933, 60022377, 60005705, 60003158, 60027954, 60003711, 60103484, 60029961, 60031841, 60005208, 60002388, 60024099, 60030304, 60029652, 60026273, 60024541, 60023247, 60007555, 60017027, 60002896, 60011605, 60027565

collaboratingInstitutions-keywords: new|york|presbyterian, HSS, hospital|special|surgery, North|Shore|hospital, Long|Island|Jewish, memorial|sloan, sloan|kettering, hamad, mount|sinai, methodist|houston, National|Institute|Mental|Health, beth israel, University|Pennsylvania|Medicine, Merck|Research, New|York|Medical|College, Medicine|Dentistry|New|Jersey, Montefiore, Lenox|Hill, Cold|Spring|Harbor, St|Luke|Roosevelt, New|York|University|Medicine, Langone, SUNY|Downstate, Albert|Einstein|Medicine, Yeshiva, UMDNJ, Icahn|Medicine, Mount|Sinai, columbia|medical, columbia|physicians

Desired output

Variables

targetAuthor-institutionalAffiliation-matchType: positiveMatch-individual
targetAuthor-institutionalAffiliation-matchType: positiveMatch-institution
targetAuthor-institutionalAffiliation-matchType: null
targetAuthor-institutionalAffiliation-matchType: noMatch

targetAuthor-institutionalAffiliation-source: Scopus
targetAuthor-institutionalAffiliation-source: PubMed

nonTargetAuthor-institutionalAffiliation-source: Scopus
nonTargetAuthor-institutionalAffiliation-source: PubMed

TargetAuthor

Case 1: Target author has affiliation statements in Scopus and PubMed

targetAuthorAffiliation
    Scopus
        1 
            targetAuthor-institutionalAffiliation-matchType: positiveMatch-individual
            targetAuthor-institutionalAffiliation-identity: "Weill Graduate School of Medical Sciences of Cornell University"
            targetAuthor-institutionalAffiliation-source: Scopus
            targetAuthor-institutionalAffiliation-article-scopusLabel: "Weill Cornell Medicine" 
            targetAuthor-institutionalAffiliation-article-scopusAffiliationID: "60007997"  
            targetAuthor-institutionalAffiliation-matchType-positiveMatch-individual-score: 3
        2
            targetAuthor-institutionalAffiliation-matchType: positiveMatch-institution
            targetAuthor-institutionalAffiliation-source: Scopus
            targetAuthor-institutionalAffiliation-identity: "Hospital for Special Surgery"
            targetAuthor-institutionalAffiliation-article-scopusLabel: "Hospital for Special Surgery"  
            targetAuthor-institutionalAffiliation-article-scopusAffiliationID: "61492421"  
            targetAuthor-institutionalAffiliation-matchType-positiveMatch-individual-score: 1.5
        3
            targetAuthor-institutionalAffiliation-matchType: noMatch
            targetAuthor-institutionalAffiliation-source: Scopus
            targetAuthor-institutionalAffiliation-article-scopusLabel: "University of Adelaide"  
            targetAuthor-institutionalAffiliation-article-scopusAffiliationID: "6999421"  
            targetAuthor-institutionalAffiliation-matchType-noMatch-individual-score: -2            
        etc...
    PubMed
            targetAuthor-institutionalAffiliation-article-pubmedLabel: "Weill Cornell Medicine, New York, NY 10065" 

Notes:

Case 2: Target author has affiliation statements in Scopus only

targetAuthorAffiliation
    Scopus
        1 
            targetAuthor-institutionalAffiliation-matchType: positiveMatch-individual
            targetAuthor-institutionalAffiliation-identity: "Weill Graduate School of Medical Sciences of Cornell University"
            targetAuthor-institutionalAffiliation-source: Scopus
            targetAuthor-institutionalAffiliation-article-scopusLabel: "Weill Cornell Medicine" 
            targetAuthor-institutionalAffiliation-article-scopusAffiliationID: "60007997"  
            targetAuthor-institutionalAffiliation-matchType-positiveMatch-individual-score: 3
        2
            targetAuthor-institutionalAffiliation-matchType: noMatch
            targetAuthor-institutionalAffiliation-source: Scopus
            targetAuthor-institutionalAffiliation-article-scopusLabel: "University of Adelaide"  
            targetAuthor-institutionalAffiliation-article-scopusAffiliationID: "6999421"  
            targetAuthor-institutionalAffiliation-matchType-noMatch-individual-score: -2            

Case 3: Target author has affiliation statements only in PubMed

targetAuthorAffiliation
    PubMed
        targetAuthor-institutionalAffiliation-source: PubMed
        targetAuthor-institutionalAffiliation-article-pubmedLabel: "Weill Cornell Graduate School of Medical Sciences, New York, New York, USA."
        targetAuthor-institutionalAffiliation-identity: "Weill Graduate School of Medical Sciences of Cornell University" /* example */
        targetAuthor-institutionalAffiliation-matchType: positiveMatch-individual
        targetAuthor-institutionalAffiliation-matchType-positiveMatch-individual-score: 2

Non-target author

Case 4: Non-target author(s) have one or more affiliation statements in Scopus

nonTargetAuthorAffiliation
    Scopus
        nonTargetAuthor-institutionalAffiliation-source: Scopus
        nonTargetAuthor-institutionalAffiliation-match-knownInstitution: Weill Cornell Medicine, 60007997, 3
        nonTargetAuthor-institutionalAffiliation-match-knownInstitution: Weill Graduate School of Medical Sciences, 60000247, 2
        nonTargetAuthor-institutionalAffiliation-match-CollaboratingInstitution: Methodist Hospital System, 60008981, 2
        nonTargetAuthor-institutionalAffiliation-match-CollaboratingInstitution: The Burke Medical Research Institute, 60022377, 1
        nonTargetAuthor-institutionalAffiliation-match-CollaboratingInstitution: The Burke Rehabilitation Hospital, 60005705, 1
        nonTargetAuthor-institutionalAffiliation-matchType-match-score: 2.4  /* example */

Notes:

Case 5: Non-target author(s) have an affiliation statement in PubMed but not Scopus

We don't consider this case.

Psuedocode

Evaluate targetAuthor

Decide which source to use for scoring.

We generally prefer to use Scopus if it's available. If it's not, we still need to provide the option to use PubMed alone.

1. As set in application.properties, is use.scopus.articles=true?
2. Does article have a Scopus affiliation for targetAuthor?
3. Does candidate article have a PubMed affiliation for targetAuthor?
4. Return the following:
targetAuthor-institutionalAffiliation-matchType: null
targetAuthor-institutionalAffiliation-matchType-null-score: 0

Evaluate Scopus affiliation

1. Get list of institutions (these are strings) from identity.Institution for target person. Also, get Scopus institution IDs from homeInstitution-scopusInstitutionIDs from application.properties.
2. Get any scopusInstitutionIDs (e.g., 60007997) from article.affiliation for targetAuthor.
3. Use values from identity.Institution to lookup Scopus institutional identifiers in InstitutionAfid table. For example Weill Graduate School of Medical Sciences of Cornell University returns:
  "afids": [
    "60007997",
    "60019868",
    "60000247",
    "60072750",
    "60026978",
    "60025849",
    "105533257"
    ]
4. Attempt match between article and identity.

If there's a positive match between article and identity, output the following:

targetAuthor-institutionalAffiliation-source: Scopus

For EACH positive match between article and identity, output the following:

targetAuthor-institutionalAffiliation-identity: "Weill Graduate School of Medical Sciences of Cornell University" /* example */
targetAuthor-institutionalAffiliation-article-scopusLabel: "Weill Cornell Graduate School of Medical Sciences"  /* example */
targetAuthor-institutionalAffiliation-article-scopusAffiliationID: "60007997"  /* example */
targetAuthor-institutionalAffiliation-matchType: positiveMatch-individual
targetAuthor-institutionalAffiliation-matchType-positiveMatch-individual-score: 2
 /* value stored in application.properties */

If match, go to 7. If no match, go to 5.

5. Attempt match using collaborating institutions, which are defined at the institutional level. Grab values from collaboratingInstitutions-scopusInstitutionIDs (stored in application.properties). Look for overlap between the two.

If there's any one positive match between article and identity, output the following for all matches:

targetAuthor-institutionalAffiliation-source: Scopus
targetAuthor-institutionalAffiliation-matchType: positiveMatch-institution
targetAuthor-institutionalAffiliation-matchType-positiveMatch-institution-score: 1
 /* value stored in application.properties */
targetAuthor-institutionalAffiliation-article-scopusLabel: "Hospital for Special Surgery"  /* example */
targetAuthor-institutionalAffiliation-article-scopusAffiliationID: "61492421"  /* example */

While there can be multiple matches, the maximum score returned for this type of match should be 1.

If no match, go to 6.

6. There's no match. Output:
targetAuthor-institutionalAffiliation-source: Scopus
targetAuthor-institutionalAffiliation-article-scopusLabel: "Hospital for Sick Children"  /* example */
targetAuthor-institutionalAffiliation-matchType: noMatch
targetAuthor-institutionalAffiliation-matchType-noMatch-score: -2  /* value stored in application.properties */

Test case: meb7002 and 22667600

Go to 7.

7. If PubMed affiliation exists, output that (but don't score it):
targetAuthor-institutionalAffiliation-article-pubmedLabel: "Weill Cornell Medicine, New York, NY 10065" 

Evaluate PubMed affiliation

1. Get list of institutions (these are strings) from identity.institutions for person under consideration.
2. Get article.affiliation for targetAuthor.
3. Preprocess.

Get list of stopwords from institution-Stopwords field in application.properties.

Remove stopwords, commas, and dashes from article.affiliation and identity.institutions.

Ignore any words inside parentheses. These are typically countries and are not included in affiliation statements.

4. Attempt match from article.affiliation and identity.institutions. The logic here is that keywords from identity.institutions are some substring of article.affiliation.

Here's how we do this match. Grab each affiliation and see if all the keywords are represented in a single affiliation. For example, suppose an author has a known affiliation in identity.institutions of "Weill Cornell Medical College". And, suppose the article affiliation is "Department of Pharmacology, Medical College of Weill Cornell." This would be a match because all the words in the identity affiliation are represented in the article affiliation.

If there's a match, output the following:

targetAuthor-institutionalAffiliation-source: PubMed
targetAuthor-institutionalAffiliation-article-pubmedLabel: "Weill Cornell Graduate School of Medical Sciences, New York, New York, USA." /* example */
targetAuthor-institutionalAffiliation-identity: "Weill Graduate School of Medical Sciences of Cornell University" /* example */
targetAuthor-institutionalAffiliation-matchType: positiveMatch-individual
targetAuthor-institutionalAffiliation-matchType-positiveMatch-individual-score: 2
 /* value stored in application.properties */

Maximum of one match.

If there's no match, go to 5.

5. Attempt match against homeInstitution-keywords.

Get homeInstitution-keywords from application.properties.

Look for cases where homeInstitution keywords is present in affiliation string in any order. Here's how we do this. Take any groups of terms from homeInstitution, e.g., "weill|cornell". In order for this to be a match, both terms must be present in any order, with any case.

If there's a match, output the following:

targetAuthor-institutionalAffiliation-source: PubMed
targetAuthor-institutionalAffiliation-article-pubmedLabel: "Weill Cornell Graduate School of Medical Sciences, New York, New York, USA." /* example */
targetAuthor-institutionalAffiliation-identity: "Weill Graduate School of Medical Sciences of Cornell University" /* example */
targetAuthor-institutionalAffiliation-matchType: positiveMatch-individual
targetAuthor-institutionalAffiliation-matchType-positiveMatch-individual-score: 2
homeInstitution-Label: Weill Cornell Medicine / NewYork-Presbyterian Hospital
 /* value stored in application.properties */

Maximum of one match.

If there's no match, go to 6.

6. Attempt match using collaborating institutions, which are defined at the institutional level. Grab values from collaboratingInstitutions-keywords (stored in application.properties). Look for overlap between the two.

If there's any one positive match between article and identity, output the following for all matches:

targetAuthor-institutionalAffiliation-source: PubMed
targetAuthor-institutionalAffiliation-matchType: positiveMatch-institution
targetAuthor-institutionalAffiliation-matchType-positiveMatch-institution-score: 1
 /* value stored in application.properties */
targetAuthor-institutionalAffiliation-article-pubMedLabel: "Hospital for Special Surgery, New York, NY 10021"  /* example */

While there can be multiple matches, the maximum score returned for this type of match should be 1.

targetAuthor-institutionalAffiliation-matchType: positiveMatch-institution

If there's no match, go to 7.

7. There's no match. Output:
targetAuthor-institutionalAffiliation-source: PubMed
targetAuthor-institutionalAffiliation-article-pubMedLabel: "Hospital for Sick Children, Quebec City, Quebec, Canada YRV MX1"  /* example */
targetAuthor-institutionalAffiliation-matchType: noMatch
targetAuthor-institutionalAffiliation-matchType-noMatch-score: -2  /* value stored in application.properties */

Evaluate nonTargetAuthor

Decide which source to use

We generally prefer to use Scopus if it's available. If it's not, we still need to provide the option to use PubMed alone.

1. As set in application.properties, is use.scopus.articles=true?
2. Does article have any Scopus affiliation for nonTargetAuthor?
3. Does candidate article have any PubMed affiliation for nonTargetAuthor?
4. Return the following:
nonTargetAuthor-institutionalAffiliation-matchType: null
nonTargetAuthor-institutionalAffiliation-matchType-null-score: 0

Evaluate Scopus affiliation

1. Preprocessing

A. Create scopusIDsNonTargetAuthor-Article.

B. Create scopusIDsNonTargetAuthor-Identity-KnownInstitutions.

C. Create scopusIDsNonTargetAuthor-Identity-CollaboratingInstitutions

2. Determine overlap.

Compute the following:

3. Compute overall score.

Get nonTargetAuthor-institutionalAffiliation-collaboratingInstitution-weight and nonTargetAuthor-institutionalAffiliation-maxScore from application.properties.

nonTargetAuthor-institutionalAffiliation-maxScore * (countScopusIDsNonTargetAuthor-Article-KnownInstitution + (countScopusIDsNonTargetAuthor-Article-CollaboratingInstitution * nonTargetAuthor-institutionalAffiliation-collaboratingInstitution-weight )) / countScopusIDNonTargetAuthor-Affiliations
4. Output values
nonTargetAuthor-institutionalAffiliation-source: Scopus
nonTargetAuthor-institutionalAffiliation-matchType-match-score: 2.4  /* example */

/* Here we're outputting Scopus institution labels, identifiers, and counts for all matching institutions. */
nonTargetAuthor-institutionalAffiliation-match-knownInstitution: Weill Cornell Medicine, 60007997, 3
nonTargetAuthor-institutionalAffiliation-match-knownInstitution: Weill Graduate School of Medical Sciences, 60000247, 2
nonTargetAuthor-institutionalAffiliation-match-CollaboratingInstitution: Methodist Hospital System, 60008981, 2
nonTargetAuthor-institutionalAffiliation-match-CollaboratingInstitution: The Burke Medical Research Institute, 60022377, 1
nonTargetAuthor-institutionalAffiliation-match-CollaboratingInstitution: The Burke Rehabilitation Hospital, 60005705, 1

Evaluate PubMed affiliation

At this time, we're not evaluating PubMed affiliation for nonTargetAuthors.

jl987-Jie commented 7 years ago

Added data from the provided file to MLab's MongoDB server.

paulalbert1 commented 6 years ago

@sarbajitdutta - A bug for ses9022 and 16614246, the institutional affiliation in Scopus is null. Therefore the score should be 0 rather than -3.

    "pmid": 16614246,
        "affiliationEvidence": {
          "scopusTargetAuthorAffiliation": [
            {
              "targetAuthorInstitutionalAffiliationSource": "SCOPUS",
              "targetAuthorInstitutionalAffiliationIdentity": null,
              "targetAuthorInstitutionalAffiliationArticleScopusLabel": null,
              "targetAuthorInstitutionalAffiliationArticleScopusAffiliationId": 0,
              "targetAuthorInstitutionalAffiliationMatchType": "NO_MATCH",
              "targetAuthorInstitutionalAffiliationMatchTypeScore": -3
            }
          ],
paulalbert1 commented 6 years ago

Also, we should match against all affiliations. We're currently only doing first. Finally, we should incorporate home institution from application.properties.

paulalbert1 commented 5 years ago

I think this is fixed.