Closed michaelbales1 closed 5 years ago
Added data from the provided file to MLab's MongoDB server.
@sarbajitdutta - A bug for ses9022 and 16614246, the institutional affiliation in Scopus is null. Therefore the score should be 0 rather than -3.
"pmid": 16614246,
"affiliationEvidence": {
"scopusTargetAuthorAffiliation": [
{
"targetAuthorInstitutionalAffiliationSource": "SCOPUS",
"targetAuthorInstitutionalAffiliationIdentity": null,
"targetAuthorInstitutionalAffiliationArticleScopusLabel": null,
"targetAuthorInstitutionalAffiliationArticleScopusAffiliationId": 0,
"targetAuthorInstitutionalAffiliationMatchType": "NO_MATCH",
"targetAuthorInstitutionalAffiliationMatchTypeScore": -3
}
],
Also, we should match against all affiliations. We're currently only doing first. Finally, we should incorporate home institution from application.properties.
I think this is fixed.
Overview
With this scoring strategy, we're trying to account for the extent to which affiliation of all authors affects the likelihood a given targetAuthor authored an article.
To do this, we need to ask and answer several questions.
Which sources are we using to make the match?
Which affiliation(s) are we considering?
What type of match is this?
About Scopus data
There are currently 276,666 institutions in the Identity table, which represents 3,861 unique institutions. This comes from several sources, which use a controlled vocabulary.
We've looked up the Scopus Institution ID for the 1,786 institutions that are most often cited as being a current or historical affiliation. This collectively represents 273,006 affiliations. In other words, ~99% of the time we can predict what the Scopus Institution ID could be. Note that a given institution such as Weill Cornell might have multiple institution IDs.
Values in application.properties
Desired output
Variables
TargetAuthor
Case 1: Target author has affiliation statements in Scopus and PubMed
Notes:
Case 2: Target author has affiliation statements in Scopus only
Case 3: Target author has affiliation statements only in PubMed
Non-target author
Case 4: Non-target author(s) have one or more affiliation statements in Scopus
Notes:
Case 5: Non-target author(s) have an affiliation statement in PubMed but not Scopus
We don't consider this case.
Psuedocode
Evaluate targetAuthor
Decide which source to use for scoring.
We generally prefer to use Scopus if it's available. If it's not, we still need to provide the option to use PubMed alone.
1. As set in application.properties, is use.scopus.articles=true?
2. Does article have a Scopus affiliation for targetAuthor?
3. Does candidate article have a PubMed affiliation for targetAuthor?
4. Return the following:
Evaluate Scopus affiliation
1. Get list of institutions (these are strings) from identity.Institution for target person. Also, get Scopus institution IDs from
homeInstitution-scopusInstitutionIDs
from application.properties.2. Get any scopusInstitutionIDs (e.g., 60007997) from article.affiliation for targetAuthor.
3. Use values from identity.Institution to lookup Scopus institutional identifiers in InstitutionAfid table. For example
Weill Graduate School of Medical Sciences of Cornell University
returns:4. Attempt match between article and identity.
If there's a positive match between article and identity, output the following:
For EACH positive match between article and identity, output the following:
If match, go to 7. If no match, go to 5.
5. Attempt match using collaborating institutions, which are defined at the institutional level. Grab values from collaboratingInstitutions-scopusInstitutionIDs (stored in application.properties). Look for overlap between the two.
If there's any one positive match between article and identity, output the following for all matches:
While there can be multiple matches, the maximum score returned for this type of match should be 1.
If no match, go to 6.
6. There's no match. Output:
Test case: meb7002 and 22667600
Go to 7.
7. If PubMed affiliation exists, output that (but don't score it):
Evaluate PubMed affiliation
1. Get list of institutions (these are strings) from identity.institutions for person under consideration.
2. Get article.affiliation for targetAuthor.
3. Preprocess.
Get list of stopwords from
institution-Stopwords
field in application.properties.Remove stopwords, commas, and dashes from article.affiliation and identity.institutions.
Ignore any words inside parentheses. These are typically countries and are not included in affiliation statements.
4. Attempt match from article.affiliation and identity.institutions. The logic here is that keywords from identity.institutions are some substring of article.affiliation.
Here's how we do this match. Grab each affiliation and see if all the keywords are represented in a single affiliation. For example, suppose an author has a known affiliation in identity.institutions of "Weill Cornell Medical College". And, suppose the article affiliation is "Department of Pharmacology, Medical College of Weill Cornell." This would be a match because all the words in the identity affiliation are represented in the article affiliation.
If there's a match, output the following:
Maximum of one match.
If there's no match, go to 5.
5. Attempt match against homeInstitution-keywords.
Get homeInstitution-keywords from application.properties.
Look for cases where homeInstitution keywords is present in affiliation string in any order. Here's how we do this. Take any groups of terms from homeInstitution, e.g., "weill|cornell". In order for this to be a match, both terms must be present in any order, with any case.
If there's a match, output the following:
Maximum of one match.
If there's no match, go to 6.
6. Attempt match using collaborating institutions, which are defined at the institutional level. Grab values from collaboratingInstitutions-keywords (stored in application.properties). Look for overlap between the two.
If there's any one positive match between article and identity, output the following for all matches:
While there can be multiple matches, the maximum score returned for this type of match should be 1.
If there's no match, go to 7.
7. There's no match. Output:
Evaluate nonTargetAuthor
Decide which source to use
We generally prefer to use Scopus if it's available. If it's not, we still need to provide the option to use PubMed alone.
1. As set in application.properties, is use.scopus.articles=true?
2. Does article have any Scopus affiliation for nonTargetAuthor?
3. Does candidate article have any PubMed affiliation for nonTargetAuthor?
4. Return the following:
Evaluate Scopus affiliation
1. Preprocessing
A. Create
scopusIDsNonTargetAuthor-Article
.B. Create
scopusIDsNonTargetAuthor-Identity-KnownInstitutions
.C. Create
scopusIDsNonTargetAuthor-Identity-CollaboratingInstitutions
2. Determine overlap.
Compute the following:
countScopusIDNonTargetAuthor-Affiliations
- non-unique count of all Scopus affiliation IDs for all authorscountScopusIDsNonTargetAuthor-Article-KnownInstitution
- count of cases where affiliation ID from scopusIDsNonTargetAuthor-Article is in scopusIDsNonTargetAuthor-Identity-KnownInstitutionscountScopusIDsNonTargetAuthor-Article-CollaboratingInstitution
- count of cases where affiliation IDfrom scopusIDsNonTargetAuthor-Article is in scopusIDsNonTargetAuthor-Identity-CollaboratingInstitutionscountScopusIDsNonTargetAuthor-Article-NoMatch
- count of cases in which none of the above are true3. Compute overall score.
Get
nonTargetAuthor-institutionalAffiliation-collaboratingInstitution-weight
andnonTargetAuthor-institutionalAffiliation-maxScore
from application.properties.4. Output values
Evaluate PubMed affiliation
At this time, we're not evaluating PubMed affiliation for nonTargetAuthors.