Closed wkiri closed 3 years ago
Note: ADS provides authors as a list, so we'll want to format that into a (comma-separated) string.
Steven also wants to handle the new format for "rel" field when there are no relations (empty list instead of missing field).
@wkiri I have updated the ingest_sqlite.py
script to use ADS fields if available. Please see the code changes in this commit (https://github.com/wkiri/MTE/commit/7bff89f5400e023a0399dc850097d65ba3b98771). I didn't have to do anything for the rel
field as the code is robust to handle that already.
Please see below for the summary of the changes I made:
Title field: if ads:title
exists, we will use the title returned by ADS database. Otherwise, we will use the one extracted from grobid. If grobid title field does not exist, an empty string will be inserted to the DB.
Authors field: the logic is similar to that of the title field. The authors returned from the ADS database will be stored in a list. Instead of converting the list into a comma-separated string as you suggested above, I converted the list into a and
-separated string. We shouldn't use comma because sometimes author names contain commas (e.g., For LPSC abstract, the author names are often in the format of last name, first name initial.
).
Primary author field: we directly the ads:primary_author
field as there is no grobid field for primary author. The logic of extracting primary author from the authors field remains unchanged if ads:primary_author
field doesn't exist.
Affiliation field: if the ads:affiliation
field exists and it doesn't contain hyphen (ADS database uses hyphens as placeholders for empty affiliation field), we will use the ads:affiliation
field. Otherwise, we use the grobid field. The affiliations returned from the ADS database is also a list, and I am converting the list into a and
-separated string.
Venue field: we directly use ads:pub_venue
field as there is no grobid field for venue. If ads:pub_venue
doesn't exist, we will insert an empty string to the DB.
@wkiri Please feel free to close this issue after you test the ingest_sqlite.py
script. Thanks.
@stevenlujpl I have tested the new ingest_sqlite.py
on the PHX database (total: 36 documents). Please see these text files to compare old/new output:
/proj/mte/sqlite/phx-d-<fieldname>-1.2.5
(used hand-edited /proj/mte/results/phx-jsre-edited-titles-authors-v2.jsonl
as the input JSON file)/proj/mte/sqlite/phx-d-<fieldname>-1.2.6-2
(using raw output from parser, /proj/mte/jsre-v2.jsonl
as the input JSON file)To reproduce or re-run (have to also run update_sqlite.py
so it prunes the docs to only the relevant ones):
$ ./ingest_sqlite.py $JSON_FILE -d $DB_FILE -m $MISSION -v lpsc > ingest-DB-$MISSION-$VERSION.log
$ ./update_sqlite.py -r $REVIEWER /proj/mte/results/$ANN_DIR $DB_FILE $MISSION -ro > update-DB-$MISSION-$VERSION.log
$ for f in title authors affiliations primary_author venue doc_url ; do sqlite3 $DB_FILE "select doc_id,$f from documents;" > phx-d-$f-1.2.6-2 ; done
The goal here was to see if the new ingest_sqlite.py
could automatically do some of the work we previously had to do by hand to generate output. Where there are discrepancies, it is worth first checking if this is something that can be fixed in ingest_sqlite.py
automatically. But if not, it is ok to do a post-ingestion hand-editing pass with my previously developed editing script. It was very time-consuming when I had to do it for ALL docs. It is not bad if I just do it for a handful of docs.
Title field: These are now in all-caps, which is perhaps okay if that is how they appear in ADS, or we could call something like title()
on them to improve readability. Most of them look good. I am capturing the ones that need further editing here in case it inspires any additional updates to the ADS parsing code (I do not know if the errors are due to failing to find the content in ADS and falling back to grobid, or some other explanation - perhaps the ingest .log
file could be enriched to capture this info):
Authors field: These do not look quite right - they are not and-separated as described above.
Primary author field: 4 are missing, 3 are wrong, 3 might be improvements
Affiliations field: I don't see any ADS-inspired updates. All content is the grobid affiliations.
Venue field: Matches previous output exactly. Looks great!
Actually, I am realizing that the JSON file I used pre-dates the updates to lpsc_parser.py
which is the step that actually calls ADS. I will need to re-run the parser for the PHX docs to fully test this. Please wait until I have a chance to do so, and I will share an update then.
@stevenlujpl I generated a new JSONL file for PHX using lpsc_parser.py
:
/proj/mte/results/phx-jsre-v2-ads.jsonl
Please see these text files to compare old/new output:
/proj/mte/sqlite/phx-d-<fieldname>-1.2.5
(used hand-edited /proj/mte/results/phx-jsre-edited-titles-authors-v2.jsonl
as the input JSON file)/proj/mte/sqlite/phx-d-<fieldname>-1.2.6
(using raw output from parser + ADS, /proj/mte/results/phx-jsre-v2-ads.jsonl
as the input JSON file)title - Looks good in general. The only problem I see is for this document:
2009_2196|Thermal and Evolved Gas Analysis of Magnesium Perchlorate: Implications for Perchlorates in Soils at the Mars Phoenix Landing Site Perchlorates in Soils at the Mars Phoenix Landing Site.
- extra words (and period) included. Check if this is how ADS reports it.authors - Has initials after each last name instead of before it, but seems ok. The only difference I see is for this document:
2009_2366|Sykulska, H. M. and Pike, W. T. and Vijendran, S.
- missing "Phoenix Microscope Team" (check if this is how ADS reports it; if so we can just leave it)affiliations - I don't see any ADS-inspired updates. All content is the grobid affiliations. Can you investigate why it might not be using ADS for this field?
primary_author - Matches previous content, except:
2017_2201|Suttter, B.
(Is typo in original ADS?)venue - Matches previous output exactly. Looks great!
@wkiri
title - Looks good in general. The only problem I see is for this document:
2009_2196|Thermal and Evolved Gas Analysis of Magnesium Perchlorate: Implications for Perchlorates in Soils >at the Mars Phoenix Landing Site Perchlorates in Soils at the Mars Phoenix Landing Site. - extra words (and >period) included. Check if this is how ADS reports it.
Yes, this is how ADS reports it. The person who entered this paper in the ADS database apparently made a mistake for the title. Please see the following screenshot.
authors - Has initials after each last name instead of before it, but seems ok. The only difference I see is for this document:
2009_2366|Sykulska, H. M. and Pike, W. T. and Vijendran, S. - missing "Phoenix Microscope Team" (check if this is how ADS reports it; if so we can just leave it)
Yes, this is how ADS reports it. The "Phoenix Microscope Team" was ignored from the author lists. Please see the following screenshot.
affiliations - I don't see any ADS-inspired updates. All content is the grobid affiliations. Can you investigate why it might not be using ADS for this field?
The ADS affiliation fields are all placeholder values (e.g. ['-,' '-', '-']) in the /proj/mte/results/phx-jsre-v2-ads.jsonl
file. There ADS affiliation fields are ignored by the ingest_sqlite.py
script. It seems ADS only has placeholder values for LPSC abstract. I checked the jsonl file (/home/youlu/MTE/working_dir/mte_parse_journals/verification_test/journal.jsonl
) for the 37 journal papers, and ADS has valid affiliation records for 34 of them.
primary_author - Matches previous content, except:
Now has first initials as well. If we keep this, we should update our schema description.
The format of the prymary_author field varies. I've seen the following three formats. It seems for journal papers, they often entered first name instead of first initial (this is probably because of the way the author names in LPSC abstract and journal papers are different).
Typo for "Sutter": 2017_2201|Suttter, B. (Is typo in original ADS?)
Yes, this is a typo in ADS. Please see the screenshot below.
Authors with "Jr" do not have the "Jr" included. (Maybe this is ok.)
It seems "Jr" was ignored by ADS.
@wkiri I updated our sqlite DB schema (https://github-fn.jpl.nasa.gov/wkiri/mte/wiki/MTE-SQLite-Database-Schema) for the primary_author
from last name of first author
to primary author name
@stevenlujpl Thank you for checking each of these items. I think since ADS information is propagating correctly, we can plan to resolve any remaining edits by hand (which will be minimal effort). This is an excellent advance in capability!
Thanks also for updating the schema (for primary_author
) on the wiki page. I have also updated the description in our bundle template for primary_author
fields in *_documents.txml
. If this looks good to you, feel free to close this issue.
The new parser-indexer-py populates several new fields with document meta-data using the ADS service. These fields start with
"ads:"
. We want to updateingest_sqlite.py
to use these fields, when available. Otherwise, fall back to the"grobid:"
fields:https://github.com/wkiri/MTE/blob/c7fd59782ee25ce96ca0280da8c54a90dcec8798/src/ingest_sqlite.py#L38-L41