Update ingest_sqlite.py to use ADS document meta-data fields

wkiri commented 3 years ago

The new parser-indexer-py populates several new fields with document meta-data using the ADS service. These fields start with "ads:". We want to update ingest_sqlite.py to use these fields, when available. Otherwise, fall back to the "grobid:" fields:

https://github.com/wkiri/MTE/blob/c7fd59782ee25ce96ca0280da8c54a90dcec8798/src/ingest_sqlite.py#L38-L41

wkiri commented 3 years ago

Note: ADS provides authors as a list, so we'll want to format that into a (comma-separated) string.

wkiri commented 2 years ago

Steven also wants to handle the new format for "rel" field when there are no relations (empty list instead of missing field).

stevenlujpl commented 2 years ago

@wkiri I have updated the ingest_sqlite.py script to use ADS fields if available. Please see the code changes in this commit (https://github.com/wkiri/MTE/commit/7bff89f5400e023a0399dc850097d65ba3b98771). I didn't have to do anything for the rel field as the code is robust to handle that already.

Please see below for the summary of the changes I made:

Title field: if ads:title exists, we will use the title returned by ADS database. Otherwise, we will use the one extracted from grobid. If grobid title field does not exist, an empty string will be inserted to the DB.
Authors field: the logic is similar to that of the title field. The authors returned from the ADS database will be stored in a list. Instead of converting the list into a comma-separated string as you suggested above, I converted the list into a and-separated string. We shouldn't use comma because sometimes author names contain commas (e.g., For LPSC abstract, the author names are often in the format of last name, first name initial.).
Primary author field: we directly the ads:primary_author field as there is no grobid field for primary author. The logic of extracting primary author from the authors field remains unchanged if ads:primary_author field doesn't exist.
Affiliation field: if the ads:affiliation field exists and it doesn't contain hyphen (ADS database uses hyphens as placeholders for empty affiliation field), we will use the ads:affiliation field. Otherwise, we use the grobid field. The affiliations returned from the ADS database is also a list, and I am converting the list into a and-separated string.
Venue field: we directly use ads:pub_venue field as there is no grobid field for venue. If ads:pub_venue doesn't exist, we will insert an empty string to the DB.

stevenlujpl commented 2 years ago

@wkiri Please feel free to close this issue after you test the ingest_sqlite.py script. Thanks.

wkiri commented 2 years ago

@stevenlujpl I have tested the new ingest_sqlite.py on the PHX database (total: 36 documents). Please see these text files to compare old/new output:

/proj/mte/sqlite/phx-d-<fieldname>-1.2.5 (used hand-edited /proj/mte/results/phx-jsre-edited-titles-authors-v2.jsonl as the input JSON file)
/proj/mte/sqlite/phx-d-<fieldname>-1.2.6-2 (using raw output from parser, /proj/mte/jsre-v2.jsonl as the input JSON file)

To reproduce or re-run (have to also run update_sqlite.py so it prunes the docs to only the relevant ones):

$ ./ingest_sqlite.py $JSON_FILE -d $DB_FILE -m $MISSION -v lpsc > ingest-DB-$MISSION-$VERSION.log
$ ./update_sqlite.py -r $REVIEWER /proj/mte/results/$ANN_DIR $DB_FILE $MISSION -ro > update-DB-$MISSION-$VERSION.log
$ for f in title authors affiliations primary_author venue doc_url ; do sqlite3 $DB_FILE "select doc_id,$f from documents;" > phx-d-$f-1.2.6-2 ; done

The goal here was to see if the new ingest_sqlite.py could automatically do some of the work we previously had to do by hand to generate output. Where there are discrepancies, it is worth first checking if this is something that can be fixed in ingest_sqlite.py automatically. But if not, it is ok to do a post-ingestion hand-editing pass with my previously developed editing script. It was very time-consuming when I had to do it for ALL docs. It is not bad if I just do it for a handful of docs.

Title field: These are now in all-caps, which is perhaps okay if that is how they appear in ADS, or we could call something like title() on them to improve readability. Most of them look good. I am capturing the ones that need further editing here in case it inspires any additional updates to the ADS parsing code (I do not know if the errors are due to failing to find the content in ADS and falling back to grobid, or some other explanation - perhaps the ingest .log file could be enriched to capture this info):
- 2009_2047|Multi-Spectral Imaging of the Phoenix Landing Site: Characteristics of Surface and Subsurface Ice, Rocks, and Soils. D
- 2010_2738|COMPARISON OF SOME PHOENIX AND GUSEV SOIL TYPES: INFERENCES ON POSSIBLE ORIGIN AND GLOBAL DISTRIBUTION. W
- 2011_2351|(Ca,Mg)-carbonate a nd Mg-carbonate a t t he P hoenix L anding Site: Evaluation of t he Phoenix L ander's Thermal Evolved G as A nalyzer (TEGA) Data Using Laboratory Simulations
- 2012_2260|STABILITY OF SHALLOW BURIED ICE ON MARS. M
- 2014_2043|PRELIMINARY IDENTIFICATION OF MINERALS IN SILT-AND SAND-SIZE GRAINS ON MARS FROM PHOENIX OM IMAGES USING THREE-CHANNEL COLOR
  - (missing final word "Photometry")
Authors field: These do not look quite right - they are not and-separated as described above.
Primary author field: 4 are missing, 3 are wrong, 3 might be improvements
- Missing:
  - 2009_1067|Arvidson
  - 2009_1329|Smith
  - 2009_1940|Sizemore
  - 2010_1481|Hecht
- Wrong:
  - 2009_1667|Markiewicz -> 2009_1667|Lander
  - 2009_2097|Shaw -> 2009_2097|Amy Shaw
  - 2014_2043|Velbel -> 2014_2043|Photometry
- Different but maybe better? Needs checking.
  - 2009_2196|Lauer Jr -> 2009_2196|Lauer
  - 2011_1516|Velbel -> 2011_1516|Vel-Bel
  - 2012_2276|Archer Jr -> 2012_2276|Archer
Affiliations field: I don't see any ADS-inspired updates. All content is the grobid affiliations.
Venue field: Matches previous output exactly. Looks great!

wkiri commented 2 years ago

Actually, I am realizing that the JSON file I used pre-dates the updates to lpsc_parser.py which is the step that actually calls ADS. I will need to re-run the parser for the PHX docs to fully test this. Please wait until I have a chance to do so, and I will share an update then.

wkiri commented 2 years ago

@stevenlujpl I generated a new JSONL file for PHX using lpsc_parser.py: /proj/mte/results/phx-jsre-v2-ads.jsonl

Please see these text files to compare old/new output:

/proj/mte/sqlite/phx-d-<fieldname>-1.2.5 (used hand-edited /proj/mte/results/phx-jsre-edited-titles-authors-v2.jsonlas the input JSON file)
/proj/mte/sqlite/phx-d-<fieldname>-1.2.6 (using raw output from parser + ADS, /proj/mte/results/phx-jsre-v2-ads.jsonl as the input JSON file)

title - Looks good in general. The only problem I see is for this document:
- 2009_2196|Thermal and Evolved Gas Analysis of Magnesium Perchlorate: Implications for Perchlorates in Soils at the Mars Phoenix Landing Site Perchlorates in Soils at the Mars Phoenix Landing Site. - extra words (and period) included. Check if this is how ADS reports it.
authors - Has initials after each last name instead of before it, but seems ok. The only difference I see is for this document:
- 2009_2366|Sykulska, H. M. and Pike, W. T. and Vijendran, S. - missing "Phoenix Microscope Team" (check if this is how ADS reports it; if so we can just leave it)
affiliations - I don't see any ADS-inspired updates. All content is the grobid affiliations. Can you investigate why it might not be using ADS for this field?
primary_author - Matches previous content, except:
- Now has first initials as well. If we keep this, we should update our schema description.
- Typo for "Sutter": 2017_2201|Suttter, B. (Is typo in original ADS?)
- Authors with "Jr" do not have the "Jr" included. (Maybe this is ok.)
venue - Matches previous output exactly. Looks great!

stevenlujpl commented 2 years ago

@wkiri

title - Looks good in general. The only problem I see is for this document:

2009_2196|Thermal and Evolved Gas Analysis of Magnesium Perchlorate: Implications for Perchlorates in Soils >at the Mars Phoenix Landing Site Perchlorates in Soils at the Mars Phoenix Landing Site. - extra words (and >period) included. Check if this is how ADS reports it.

Yes, this is how ADS reports it. The person who entered this paper in the ADS database apparently made a mistake for the title. Please see the following screenshot.

Screen Shot 2021-08-27 at 10 47 46 AM

authors - Has initials after each last name instead of before it, but seems ok. The only difference I see is for this document:

2009_2366|Sykulska, H. M. and Pike, W. T. and Vijendran, S. - missing "Phoenix Microscope Team" (check if this is how ADS reports it; if so we can just leave it)

Yes, this is how ADS reports it. The "Phoenix Microscope Team" was ignored from the author lists. Please see the following screenshot.

Screen Shot 2021-08-27 at 10 50 15 AM

affiliations - I don't see any ADS-inspired updates. All content is the grobid affiliations. Can you investigate why it might not be using ADS for this field?

The ADS affiliation fields are all placeholder values (e.g. ['-,' '-', '-']) in the /proj/mte/results/phx-jsre-v2-ads.jsonl file. There ADS affiliation fields are ignored by the ingest_sqlite.py script. It seems ADS only has placeholder values for LPSC abstract. I checked the jsonl file (/home/youlu/MTE/working_dir/mte_parse_journals/verification_test/journal.jsonl) for the 37 journal papers, and ADS has valid affiliation records for 34 of them.

primary_author - Matches previous content, except:

Now has first initials as well. If we keep this, we should update our schema description.

The format of the prymary_author field varies. I've seen the following three formats. It seems for journal papers, they often entered first name instead of first initial (this is probably because of the way the author names in LPSC abstract and journal papers are different).

Last name, first initial middle initial
Last name, first name middle initial
Last name, middle initial first name

Typo for "Sutter": 2017_2201|Suttter, B. (Is typo in original ADS?)

Yes, this is a typo in ADS. Please see the screenshot below. Screen Shot 2021-08-27 at 11 30 27 AM

Authors with "Jr" do not have the "Jr" included. (Maybe this is ok.)

It seems "Jr" was ignored by ADS.

Screen Shot 2021-08-27 at 11 33 21 AM

stevenlujpl commented 2 years ago

@wkiri I updated our sqlite DB schema (https://github-fn.jpl.nasa.gov/wkiri/mte/wiki/MTE-SQLite-Database-Schema) for the primary_author from last name of first author to primary author name

wkiri commented 2 years ago

@stevenlujpl Thank you for checking each of these items. I think since ADS information is propagating correctly, we can plan to resolve any remaining edits by hand (which will be minimal effort). This is an excellent advance in capability!

Thanks also for updating the schema (for primary_author) on the wiki page. I have also updated the description in our bundle template for primary_author fields in *_documents.txml. If this looks good to you, feel free to close this issue.

wkiri / MTE

Update ingest_sqlite.py to use ADS document meta-data fields #8