wkiri / MTE

Mars Target Encyclopedia
Apache License 2.0
5 stars 0 forks source link

Process journal papers and add content to MTE #45

Open wkiri opened 2 years ago

wkiri commented 2 years ago

The first step is to try parsing the journal documents @stevenlujpl already downloaded. For some documents, we may need to process them multiple times for each mission whose targets are mentioned (see issue #22).

wkiri commented 2 years ago

Note: the MTE schema currently has an abstract column in the documents table. Journal papers do not have an abstract number, so we should decide how to handle this.

We will also need to decide how to generate a doc_id which currently is year_abstract. Perhaps it should be year_venue_paperid in which case venue would be a short form like lpsc or jgr and paperid would be the same as abstract for LPSC publications and something like volume-number-page for journal papers. Then maybe this paperid would take the place of the abstract column in documents (we'd want to rename the column as needed).

stevenlujpl commented 2 years ago
stevenlujpl commented 2 years ago

I categorized 18 papers that passed the initial filtering process by missions to ensure that we at least have one paper for each MERA, MERB, MPF, and PHX mission. Some papers may appear in more than one mission list. For example, the paper 2003JE002125.pdf talks about the MER landing sites, so it has been categorized within both MERA and MERB mission lists.

MERA:

MERB:

MPF:

PHX:

wkiri commented 2 years ago

Thanks, Steven! It looks like there are 14 unique papers here. Are the other 4 that passed the filter worth including?

stevenlujpl commented 2 years ago

@wkiri The other 4 papers are MSL papers. Sorry that I forgot to mention them.

MSL:

wkiri commented 2 years ago

@stevenlujpl Great, thanks for the clarification!

stevenlujpl commented 2 years ago

@wkiri I tested the changes I made to the MTE codebase, and it seems it is working fine. Please see the commits above for details about the code changes. The changes are currently checked into the issue45-journal branch. Please see the summary of the changes below:

  1. doc_id field: The doc_id field is now used to store filenames (without extension) for journal papers. The doc_id field can also be made to store document indices. If you think document indices make more sense than filenames, please let me know and I will update the code to store document indices in the doc_id field.
  2. abstract field: The abstract field is ignored for journal papers. With the changes I made, the abstract field will be empty strings in the DB, and it will not be included as a column in the exported CSV file of the PDS4 bundle.
  3. doc_url and year fields: The doc_url and year fields are currently handled in the same way as the abstract field.
  4. I added a new CLI argument venue to the bundle generation script. The venue argument is used to distinguish LPSC v.s. other documents. If the input documents are from LPSC, the documents.csv of the PDS4 bundle will have 8 columns (i.e., all the 8 fields from the documents table of the DB are exported to the CSV file). If the input documents are from a venue other than LPSC, then the documtnes.csv will only have 5 columns (the abstract, doc_url, and year fields are skipped).

I tested this approach using the 18 journal papers that passed the initial filtering process. the MPF jsonl, DB, and PDS4 bundle files can be found at the following locations in my /home dir. The bundle validate tool reported 0 errors.

/home/youlu/MTE/working_dir/process_journals/mpf/mpf_init_filter.jsonl
/home/youlu/MTE/working_dir/process_journals/mpf/mpf_init_filter.db
/home/youlu/MTE/working_dir/process_journals/mpf/mars_target_encyclopedia

This approach requires only a few minor changes (as shown in the commits above) in the codebase because it doesn't require changes in DB schema. The drawback is that the abstract field of the documents table in the current DB schema doesn't really apply to journal papers. I think this may be okay because I consider DB files as the intermediate products. The PDS4 bundles as the final delivered products don't have the abstract field in the CSV file. Please let me know what you think.

The current MTE website code won't work with the DB files generated from journal papers primarily because of the lack of the doc_url field in the documents table. I will update the website code if we are planning to use it for journal DBs. Please let me know. Thanks.

wkiri commented 2 years ago

@stevenlujpl This is great progress! Thank you!

Can you place the generated .jsonl files under /proj/mte/? That way I can generate brat review pages for them.

The changes seem fine in general. I have two questions: 1) I think that it would be good to retain the year field. Is this possible?
2) Can we populate doc_url with the journal paper DOI (which can be formatted as a URL if it isn't already)? I think this is included in the ADS results. While not available for LPSC, it should be available for journal papers.

It seems I overlooked that "abstract" is not included in the final .csv files that are delivered. I should correct this in the schema diagram and in the README.

I agree that the sqlite DB is an intermediate product so it's ok for it to have more information even if not used later. As you note, however, the website does use the DB directly. It makes sense to prioritize getting the journal paper content into PDS4 bundles first, and if time remains, then update the website (but it's not on the critical path for the time remaining).

stevenlujpl commented 2 years ago
  1. I think that it would be good to retain the year field. Is this possible?

It should be possible to retain the year field for documents indexed in the ADS database.

  1. Can we populate doc_url with the journal paper DOI (which can be formatted as a URL if it isn't already)? I think this is included in the ADS results. While not available for LPSC, it should be available for journal papers.

It seems from the ADS website search results, the DOI fields are already formatted as URLs. I will double-check the format of the DOIs returned by directly querying the ADS database. These are great suggestions. I will work on them now.

I copied the .jsonl files to the following locations in /proj/mte/. Please let me know if you run into any problems generating the brat review pages.

/proj/mte/data/steven_working_dir/process_journals/mera/mera_init_filter.jsonl
/proj/mte/data/steven_working_dir/process_journals/merb/merb_init_filter.jsonl
/proj/mte/data/steven_working_dir/process_journals/mpf/mpf_init_filter.jsonl
/proj/mte/data/steven_working_dir/process_journals/phx/phx_init_filter.jsonl
stevenlujpl commented 2 years ago

@wkiri I couldn't test the update_sqlite.py step to update the DB with human-reviewed brat annotations yet, but I don't foresee any problems because the DB schema isn't changed.

stevenlujpl commented 2 years ago

@wkiri Do you know how to use a DOI to form a URL? The DOIs returned by the ADS database aren't formed in URL. For example, the DOI returned for "Analysis of MOLA data for the Mars Exploration Rover landing sites" is 10.1029/2003JE002125. How do I convert the DOI into a URL?

stevenlujpl commented 2 years ago

I just googled, and it seems we can use this patter https://doi.org/xxxxx (where the xxxxx is DOI) to convert DOI to URL.

stevenlujpl commented 2 years ago

@wkiri I've added the year and doc_url back to the DB and the exported documents.csv files. Please see the following files as examples. Please let me know if you find any problems. Thanks.

/home/youlu/MTE/working_dir/process_journals/mpf/mpf_init_filter.db
/home/youlu/MTE/working_dir/process_journals/mpf/mars_target_encyclopedia/data_mpf/documents.csv
wkiri commented 2 years ago

I created a brat site for the MPF JSONL file here: https://ml.jpl.nasa.gov/mte/brat/#/mpf/journals/ As noted above, probably the only document to review for MPF would be this one: https://ml.jpl.nasa.gov/mte/brat/#/mpf/journals/2016JE005079 Unfortunately, I do not see any named MPF targets (although there are targets from several other missions mentioned). I will check again to see if the 8 MPF papers Matt previously shared were in the set of 35 that were checked, or if they should be added.

wkiri commented 2 years ago

The PHX output is available here: https://ml.jpl.nasa.gov/mte/brat/#/phx/journals There are 6 relevant documents. These three have named targets and are ready for review:

stevenlujpl commented 2 years ago

@wkiri Do we need to add a few more journal papers for MPF?

wkiri commented 2 years ago

Yes, we should try to add the 4 JGR + 1 Science papers that are referenced in https://github.com/wkiri/MTE/tree/master/ref/MPF#readme I had two of them handy (bell-mpf-00.pdf, golombek-mpf-99.pdf) and put them in the JGR directory. I also added golombek-mpf-00.pdf, greeley-mpf-00.pdf, landis-mpf-00.pdf, and morris-mpf-00.pdf in case they have useful content. Could you process these with your MPF run? You could generate a separate .jsonl file for this batch - no need to run them all together with the earlier docs. See: /proj/mte/data/corpus-journals/pdf/jgr-planets/

wkiri commented 2 years ago

The MER-A output is available here: https://ml.jpl.nasa.gov/mte/brat/#/mer-a/journals There are 4 relevant documents. These three have named targets and are ready for review:

wkiri commented 2 years ago

The MER-B output is available here: https://ml.jpl.nasa.gov/mte/brat/#/mer-b/journals There are 6 relevant documents. All six have named targets, but the ones in 2003JE002125 are spurious, so focus on these five for review:

wkiri commented 2 years ago

To make review easier, I have pruned the documents for each mission in the "journals" directory under brat to only include the documents to be reviewed.

stevenlujpl commented 2 years ago

@wkiri I've processed the 6 MPF documents you added. 5 documents were successfully processed, and one document (golombek-mpf-99.pdf) failed due to jSRE out of memory problem.

I've copied the jsonl file to the following location:

/proj/mte/results/journals/mpf_2nd.jsonl

I also copied the jsonl files from the initial MERA, MERB, MPF, and PHX runs to /proj/mte/results/journals/.

wkiri commented 2 years ago

@stevenlujpl Thank you, that was fast! I'll look at these tomorrow.

wkiri commented 2 years ago

@stevenlujpl These look great! They are now available at https://ml.jpl.nasa.gov/mte/brat/#/mpf/journals

stevenlujpl commented 2 years ago

@wkiri Great! Thanks for sharing the brat URL. There are targets and relations, which look promising. Please let me know if you need help reviewing them (even after this week).

stevenlujpl commented 2 years ago

@wkiri I have updated the MTE parser and bundle generation scripts based on what we discussed on Monday. Please see the following steps for generating a PDS4 bundle with both LPSC and journal papers:

  1. Run lpsc_parser.py with LPSC papers to generate lpsc.jsonl
  2. Run paper_parser.py with journal papers to generate journal.jsonl
  3. Manually concatenate lpsc.jsonl and journal.jsonl (i.e., cat lpsc.jsonl > combined.jsonl and then cat journal.json >> combined.jsonl)
  4. Run ingest_sqlite.py with combined.jsonl to generate combined.db with both LPSC and journal papers. Please note that the venue CLI argument has been removed as we discussed. Now, ingest_sqlite.py relies on the parser list field (i.e., rec['metadata']['mte_parser'] field) to distinguish if the document being processed is LPSC or others.
  5. Run generate_pds4_bundle.py with the combined.db DB file to generate the MTE PDS4 bundle.

I tested the scripts with 5 LPSC and 1 journal paper, and verified the results manually and with the PDS4 validate tool. I didn't find any problem. I am attaching the jsonl, DB, and bundle file in the following .zip file. Please take a look and let me know if you find any problems. Thanks.

Archive.zip

wkiri commented 2 years ago

@stevenlujpl This sounds great!!! Thanks for pulling it all together.

I haven't looked at the .zip file yet but will try to do so tomorrow.

For the full process, I believe there will be 2 steps between 4 and 5 in which we run update_sqlite.py twice (once using manually reviewed LPSC docs, once using manually reviewed journal docs, only because they are in different directories... I guess we could put them in one directory if that makes this easier). Can you take a look and see if you think we need any changes to update_sqlite.py ? You could test it with the .ann files I generated for e.g. MPF, which are not yet reviewed but are in the correct format. However, if you are out of time for this task for this week, our "test" can take place when we get to actually merging the two sets of annotations and find out if it works :)

wkiri commented 2 years ago

The per-mission LPSC .jsonl files are:

See /proj/mte/results/README.txt for details on each file. Note that the MER-A file was generated after we identified the 397 documents with at least one Target. We didn't have that list yet for MER-B so its file contains the entire set of 1635 candidate documents. However, many should be omitted at the remove-orphans step of update_sqlite.py. (This is the step that I think could be problematic if run twice)

stevenlujpl commented 2 years ago

@wkiri I've added the script to insert mte_parser fields to an existing jsonl file. I also processed the pre-mission LPSC .jsonl files to insert mte_parser fields. The updated pre-mission LPSC .jsonl files are at the following locations:

Please take a look and let me know if you find any problems. Thanks.