wcmc-its / ReCiter

ReCiter: an enterprise open source author disambiguation system for academic institutions
Apache License 2.0
45 stars 23 forks source link

First name scoring does not properly match in cases where nameMatchFirstType should be "full-conflictingAllButInitials" #474

Closed paulalbert1 closed 3 years ago

paulalbert1 commented 3 years ago

Description

First name scoring is off since the most recent commit to master branch.

In general, there are far too many cases where nameMatchFirstType = full-exact.

Here are several thousand where nameMatchFirstType should = full-conflictingAllButInitials. Note there are a handful of false positives in here, which are hard to programmatically weed out using MySQL. For example: Levenshtein distance, "Gabriel Glenn" vs. "G Glenn", etc.

first name match errors-2021-08-20.csv

Example

July

This was correct....

select pmid, nameMatchFirstScore, nameMatchFirstType, articleAuthorNameFirstName
from personArticle20210719
where personIdentifier = 'ajg9004'
and articleAuthorNameFirstName = 'Amita'

pmid    nameMatchFirstScore nameMatchFirstType  articleAuthorNameFirstName
24658103    -2.646  full-conflictingAllButInitials  Amita

August

This is incorrect....

select pmid, nameMatchFirstScore, nameMatchFirstType, articleAuthorNameFirstName
from personArticle
where personIdentifier = 'ajg9004'
and articleAuthorNameFirstName = 'Amita'

pmid    nameMatchFirstScore nameMatchFirstType  articleAuthorNameFirstName
24658103    0   full-exact  Amita