wcmc-its / ReCiter

ReCiter: an enterprise open source author disambiguation system for academic institutions
Apache License 2.0
45 stars 23 forks source link

Downweight cases where org unit doesn't match #523

Open paulalbert1 opened 10 months ago

paulalbert1 commented 10 months ago

Background

There are a number of cases where a user will have org units in their profile and they don't even come close to matching the org unit on file. To this point, we've ignored such cases. But maybe we can use this data to cut down on false positives.

An example is personIdentifier = sue2002 and PMID = 36630615. Psychiatry (sue2002's org unit) is very different than Cell and Developmental Biology.

Screenshot 2023-11-07 at 5 25 19 PM

For our data set, I estimate this will improve accuracy by 0.5%, by reducing the number of false positives. But given our use of organizational synonyms, the only way to tell for certain would be to run this for everyone.

Requirements

This Java file outputs in part a value called strategy.orgUnitScoringStrategy.organizationalUnitDepartmentMatchingScore. This is for a positive departmental match. I want to update the code so it also outputs a organizationalUnitDepartmentNegativeMatchingScore in these circumstances:

  1. identity.getOrganizationalUnits() != null
  2. articleAffiliation != null
  3. The words "Department of ", "Division of ", etc. exist in articleAffiliation string but that match fails.

See this PR. It hasn't been "tested" and it probably doesn't "work," but I think it's on the right track.

Here's how a particular downweight affects the number of true / false positives / negatives. This is from a set of ~200,000 articles.

0 (downweight) - 7657 (error count)

FALSE NEGATIVE  3779
FALSE POSITIVE  3878
TRUE NEGATIVE   11094
TRUE POSITIVE   26427

0.1 - 7560

FALSE NEGATIVE  3976
FALSE POSITIVE  3584
TRUE NEGATIVE   11388
TRUE POSITIVE   26230

0.2 - 7442

FALSE NEGATIVE  4193
FALSE POSITIVE  3249
TRUE NEGATIVE   11723
TRUE POSITIVE   26013

0.3 - 7279

FALSE NEGATIVE  4445
FALSE POSITIVE  2834
TRUE NEGATIVE   12138
TRUE POSITIVE   25761

0.4 - 7303

FALSE NEGATIVE  4675
FALSE POSITIVE  2628
TRUE NEGATIVE   12344
TRUE POSITIVE   25531

0.5 - 7374

FALSE NEGATIVE  5051
FALSE POSITIVE  2323
TRUE NEGATIVE   12649
TRUE POSITIVE   25155

Test case

The combination of personIdentifier = sue2002 and PMID = 36630615 should return this...

        "organizationalUnitEvidence": [
          {
            "identityOrganizationalUnit": "Payne Whitney (Psychiatry)",
            "articleAffiliation": "Department of Cell and Developmental Biology, University College London, London, UK.",
            "organizationalUnitType": "DEPARTMENT",
            "organizationalUnitMatchingScore": -0.4,
            "organizationalUnitModifierScore": 0
          }
        ],