Add first name likelihood scoring strategy

Background

For PMID = 34739873, Jeetayu Biswas (jeb9333) has a nameMatchFirstScore of 1.852.

For PMID = 23834756, John Moore (jpm2003) has a nameMatchFirstScore of 1.852.

In a sample set of ~4 million names in PubMed, of which 288,953 start with J:

1 name is Jeetayu. This is at the 0.8952 percentile.
10981 names are John. This is at the 99.9519 percentile.

These matches are not scored optimally. It's far more unlikely that a name will match on Jeetayu and, therefore it should receive a higher score.

Accounting for these differences against the 250k records in WCM's dataset can improve overall accuracy, relatively speaking, by 6-7 percent, mainly by cutting down on false positives.

Low values indicate that a name is common, and high values indicate that a name is uncommon. Note that this approach accounts for likelihood for a given letter. Q is a less common first initial and as a result "Qi" would have a relatively higher penalty against it than, say, John when compared it against all J's.

Requirements

New DynamoDB table

Create a new table for DynamoDB called "firstNameFrequency." Here is the file as JSON.

The file should live at ReCiter/src/main/resources/files/firstNameFrequency.json

To improve performance, the firstName value should be indexed in DynamoDB.

Create new values in application.properties

strategy.first.name.likelihood=true

strategy.nameMatchFirstLikelihoodScore.maximumScore=0.14
strategy.nameMatchFirstLikelihoodScore.weight=0.82

Create new strategy in code

Following existing design patterns and create a new scoring strategy in the code, first.name.likelihood. This is somewhat similar to the Gender Strategy in that it looks up values from DynamoDB and has to account for the possibility of multiple values.

Here's how it should work:

Remove periods from institutionalAuthorNameFirstName
Find all substrings, as delimited by a space, in institutionalAuthorNameFirstName.
Exclude any substrings that are one character
Now we need to look up the values in the firstNameFrequency.json file
If there is no result, we go with the value in strategy.nameMatchFirstLikelihoodScore.maximumScore.
Multiply whatever you retrieve by strategy.nameMatchFirstLikelihoodScore.weight.
The result is nameMatchFirstLikelihoodScore.
Include this when computing nameScoreTotal.

To optimize performance, we should only be looking up a single name once each time Feature Generator is run.

Output in Feature Generator API

Here's how this should look in the Feature Generator API output. See the last line.

        "authorNameEvidence": {
          "institutionalAuthorName": {
            "firstName": "Curtis",
            "firstInitial": "C",
            "lastName": "Cole"
          },
          "articleAuthorName": {
            "firstName": "Curtis",
            "firstInitial": "C",
            "lastName": "Cole"
          },
          "nameScoreTotal": 3.31,
          "nameMatchFirstType": "full-exact",
          "nameMatchFirstScore": 1.852,
          "nameMatchMiddleType": "identityNull-MatchNotAttempted",
          "nameMatchMiddleScore": 0.794,
          "nameMatchLastType": "full-exact",
          "nameMatchLastScore": 0.664,
          "nameMatchModifierScore": 0,
          "nameMatchFirstLikelihoodScore": -0.058
        },

Test cases

In each case, we are multiplying by strategy.nameMatchFirstLikelihoodScore.weight.

personID	pmid	name	logic
bas4003	34973498	Barzan	This firstName is missing from JSON so use strategy.nameMatchFirstLikelihoodScore.maximumScore
kpxu	14700639	Kangpu	This firstName is missing from JSON so use strategy.nameMatchFirstLikelihoodScore.maximumScore
sky2001	27890427	Sae hee	Break up into "Sae" and "hee". Look up individually. Average result.
muh2006	35713518	Mu ji	Break up into "Mu" and "ji". Look up individually. Average result.
stf3001	2331227	Steven g	Look up "Steven" only.
bab2013	8069273	A. bartley	Look up "bartley" only.
als4033	36114352	Alia mahmoud hassan	Break up into "Alia", "mahmoud", and "hassan". Look up individually. Average result.
din9007	33631875	Dilfuza	Look up "Dilfluza"
aha4006	32206638	Alanoud	Look up "Alanoud"
ceg9018	12127811	Cecily	Look up "Cecily"
dis4002	32576946	Dimitry	Look up "Dimitry"

wcmc-its / ReCiter