wcmc-its / ReCiter

ReCiter: an enterprise open source author disambiguation system for academic institutions
Apache License 2.0
45 stars 25 forks source link

Cross reference funding statement in PubMed against institutional records of grant funding to improve recall #89

Closed paulalbert1 closed 7 years ago

paulalbert1 commented 9 years ago

Monica L. Guzman (mlg2007) wrote this paper: 25557492. A good piece of evidence is her (or her co-authors') statement in the PubMed record of where funding came from.

If you look under grant support for the PubMed record, you see these declarations:

Take a five or six-digit code and parse them like so:

Now, go see if these ID's are in rc_identity_grant.sponsorAwardId and cross-reference against CWID of mlg2007.

The last ID is listed in the sponsorAwardId field as 5 R21 CA158728-02, which matches against 158275. Ignore the characters before and after the six-digit codes. In the above example, ignore "5 R21 CA" and "-02".

Related to #49

paulalbert1 commented 9 years ago

The logic should be "[letter][letter][5 or 6 digit number] and then if anything that is not a number appears such as a space or dash or parenthesis, that’s the end of the string…. Don’t worry about the edge cases. These data can be messy.

bmudhavathu commented 9 years ago

Please provide more information on this.. do we need to parse these information before inserting into DB table or do we need to extract these information while parsing XML files from PubMed database?

bmudhavathu commented 9 years ago

Hi Jie/Paul, What needs to be done next after doing cross reference check ... We can get the grant/fund id's from PubMed and can verify from DB table ... after extracting and matching grant id's is there any step we need to do ? ... or do we need to update the these ID's with existing sponserAwardedId's ??, whatever we do extracting is temporary basis .. because we are not doing anything after that.. Please advise

bmudhavathu commented 9 years ago

Hi Jin,

Please provide where is the possible location of this code to be implanted?... please provide pseudo code

michaelbales1 commented 9 years ago

Hi Balu,

If there is a match, fundingStatementScore should be assigned a value of 1. This score is referenced in the following locations in the code:

/src/main/java/reciter/utils/writer/AnalysisCSVWriter.java /src/main/java/reciter/erroranalysis/AnalysisObject.java /src/main/java/reciter/erroranalysis/AnalysisTranslator.java

Is this helpful? Please let us know if you need any additional details.

bmudhavathu commented 9 years ago

Yes, Thanks Michael... It helps to complete the code level implementation.

bmudhavathu commented 9 years ago

Hi Jie Lin / Michael,

If you look under grant support for the PubMed record, you see these declarations: •1 DP2 OD007399-01/OD/NIH HHS/United States •CA 140409/CA/NCI NIH HHS/United States

As stated above in the first statement for getting the PubMed Record , do we need to extract these information when reading XML file from AbstractXMLFetcher.java file? or any other way we can get these strings ... from existing class file... Please advise..

bmudhavathu commented 9 years ago

Hi Jie Lin / Michael / Paul,

I guess i can use the above PubMed record FundingStatement String from PubMedXMLFetcher.java class to get the String using CWID ... I hope i am at right direction to get this Funding String as per my logical analysis, Please advise if not ..

michaelbales1 commented 9 years ago

Jie reviewed the code and it looks good.

michaelbales1 commented 9 years ago

Hanumantha has integrated the code into ReCiterClusterer; he will update the code so that it will write a score to the CSV file or to the database.

paulalbert1 commented 7 years ago

Issue closed. We are currently using the Grant Ids from Oracle (since that is the same as the sponsor award) and I remember you mentioned that since we get the ids from Oracle, no parsing is now required.