wcmc-its / ReCiter

ReCiter: an enterprise open source author disambiguation system for academic institutions
Apache License 2.0
45 stars 23 forks source link

Add first name likelihood scoring strategy #510

Open paulalbert1 opened 1 year ago

paulalbert1 commented 1 year ago

Background

For PMID = 34739873, Jeetayu Biswas (jeb9333) has a nameMatchFirstScore of 1.852.

For PMID = 23834756, John Moore (jpm2003) has a nameMatchFirstScore of 1.852.

In a sample set of ~4 million names in PubMed, of which 288,953 start with J:

These matches are not scored optimally. It's far more unlikely that a name will match on Jeetayu and, therefore it should receive a higher score.

Accounting for these differences against the 250k records in WCM's dataset can improve overall accuracy, relatively speaking, by 6-7 percent, mainly by cutting down on false positives.

Low values indicate that a name is common, and high values indicate that a name is uncommon. Note that this approach accounts for likelihood for a given letter. Q is a less common first initial and as a result "Qi" would have a relatively higher penalty against it than, say, John when compared it against all J's.

Requirements

New DynamoDB table

Create a new table for DynamoDB called "firstNameFrequency." Here is the file as JSON.

The file should live at ReCiter/src/main/resources/files/firstNameFrequency.json

To improve performance, the firstName value should be indexed in DynamoDB.

Create new values in application.properties

strategy.first.name.likelihood=true

strategy.nameMatchFirstLikelihoodScore.maximumScore=0.14
strategy.nameMatchFirstLikelihoodScore.weight=0.82

Create new strategy in code

Following existing design patterns and create a new scoring strategy in the code, first.name.likelihood. This is somewhat similar to the Gender Strategy in that it looks up values from DynamoDB and has to account for the possibility of multiple values.

Here's how it should work:

To optimize performance, we should only be looking up a single name once each time Feature Generator is run.

Output in Feature Generator API

Here's how this should look in the Feature Generator API output. See the last line.

        "authorNameEvidence": {
          "institutionalAuthorName": {
            "firstName": "Curtis",
            "firstInitial": "C",
            "lastName": "Cole"
          },
          "articleAuthorName": {
            "firstName": "Curtis",
            "firstInitial": "C",
            "lastName": "Cole"
          },
          "nameScoreTotal": 3.31,
          "nameMatchFirstType": "full-exact",
          "nameMatchFirstScore": 1.852,
          "nameMatchMiddleType": "identityNull-MatchNotAttempted",
          "nameMatchMiddleScore": 0.794,
          "nameMatchLastType": "full-exact",
          "nameMatchLastScore": 0.664,
          "nameMatchModifierScore": 0,
          "nameMatchFirstLikelihoodScore": -0.058
        },

Test cases

In each case, we are multiplying by strategy.nameMatchFirstLikelihoodScore.weight.

personID pmid name logic
bas4003 34973498 Barzan This firstName is missing from JSON so use strategy.nameMatchFirstLikelihoodScore.maximumScore
kpxu 14700639 Kangpu This firstName is missing from JSON so use strategy.nameMatchFirstLikelihoodScore.maximumScore
sky2001 27890427 Sae hee Break up into "Sae" and "hee". Look up individually. Average result.
muh2006 35713518 Mu ji Break up into "Mu" and "ji". Look up individually. Average result.
stf3001 2331227 Steven g Look up "Steven" only.
bab2013 8069273 A. bartley Look up "bartley" only.
als4033 36114352 Alia mahmoud hassan Break up into "Alia", "mahmoud", and "hassan". Look up individually. Average result.
din9007 33631875 Dilfuza Look up "Dilfluza"
aha4006 32206638 Alanoud Look up "Alanoud"
ceg9018 12127811 Cecily Look up "Cecily"
dis4002 32576946 Dimitry Look up "Dimitry"