wkiri / MTE

Mars Target Encyclopedia
Apache License 2.0
5 stars 0 forks source link

Generate MTE PDS4 bundle with MER-B content #40

Closed wkiri closed 10 months ago

wkiri commented 2 years ago
wkiri commented 2 years ago

The MER-B JSON file is complete and available at: /proj/mte/results/mer-b-jsre-v2-ads-gaz.jsonl The list of 1635 documents included is in /proj/mte/results/pdfpaths-mer-b.list

wkiri commented 2 years ago

The automatically generated annotations are browsable at: https://ml.jpl.nasa.gov/mte/brat/#/mer-b/all-jsre-v2 There are a lot of spurious Targets annotated, and 1454/1635 documents have at least one Target present. As with MER-B, it seems worth pruning the target list to "salient" Target names (excluding Top, Bottom, Greeley, Mariner, Venera, Base, landing, stripes, lost, tracks, etc.) and re-running.

wkiri commented 2 years ago

I've trained a new NER model using the salient MER-B targets. It is available at /proj/mte/trained_models/ner_MERB-property-salient.ser.gz I am re-parsing these documents, which will take another 3.5 hours :)

wkiri commented 2 years ago

Note that there is one document (2006_2401.pdf) that generates an Out of Memory error from jSRE while being processed. It does generate Contains relations, so I assume it fails on the HasProperty model. It generates a LOT of Contains relations (439) in this document, so this could contribute to the memory issue. For now I will just work with what we get and add any missing HasProperty relations by hand.

wkiri commented 2 years ago

The new output (using salient targets only) is now available at https://ml.jpl.nasa.gov/mte/brat/#/mer-b/all-jsre-v2 This greatly reduced the number of spurious Target annotations. We now have 1166/1635 documents with at least one Target. I moved the previous results to https://ml.jpl.nasa.gov/mte/brat/#/mer-b/all-jsre-v2-alltargets if anyone wants to browse them.

I am now performing a quick triage to remove remaining residual targets and reduce the review effort ahead.

wkiri commented 2 years ago

Triage is complete for docs from 2004 and 2005. This reduces the document set for those two years from 241 to 28 documents, which is good news. I will continue triaging the remaining years, but on Monday I will start assigning review docs to the team to get things going for the years that are ready.

wkiri commented 2 years ago

Triage is complete for 2004-2006, leaving 60 documents in that span that need review. I've asked Matt, Leslie, and Raymond to each review 20 documents.

wkiri commented 2 years ago

Reviews are complete for 2004-2006 (n=60), and I've assigned 19 docs from 2007 to Raymond and 21 docs from 2008 and 2009 to Matt.

wkiri commented 2 years ago

Reviews are complete for 2004-2009 (except 2007) (n=81), and I've assigned 18 docs from 2010 to Matt.

wkiri commented 2 years ago

Please note: to address the above question about versioning, PDS made us go to version 2.0 when we added mer2, so likely we should advance the bundle to version 3.0 with mer1. (Individual files will advance versions only as needed)

wkiri commented 2 years ago

Reviews are complete for 2004-2009 (n=100), Matt is working on 2010 (n=18), and I've assigned 2011-2012 (n=12+15) to Leslie.

wkiri commented 2 years ago

Triage is complete for 2013, yielding 16 more documents (total of 161 for Opportunity so far; 100 are reviewed).

wkiri commented 2 years ago

Triage is complete for 2014, 2015, 2016, and 2017, yielding 59 more documents (total of 220 for Opportunity so far; 100 are reviewed). I've assigned 2013 (n=16) to Raymond.

wkiri commented 2 years ago

Triage is complete for 2018, with 15 more documents (total of 235 for Opportunity; 118 are reviewed).

wkiri commented 2 years ago

Please note: Content under /var (where our brat .ann reviewed files are stored) is not backed up. Therefore, the reviewed results are rsync'd manually to

These locations are backed up. A lot of time goes in to reviewing the documents and we don't want to lose that work :)

wkiri commented 2 years ago

Triage is complete for 2019 and 2020, which completes this set! This added 21 more documents, for a total of 256 for Opportunity. Raymond has 2014-2015, Matt has 2016-2017, and I will review 2018-2020.

wkiri commented 2 years ago

Reviews are complete for 2004-2013 and 2016-2020, with just 2014-2015 to go. Getting close!

wkiri commented 2 years ago

Reviews are complete, and I am now going over them to achieve consistency. This is taking some time :) Hopefully finished soon!

wkiri commented 2 years ago

Consistency review is complete for 2004-2006 and 2018-2020. Eleven years left... :)

wkiri commented 2 years ago

Consistency review is now complete for 2004-2008 and 2018-2020.

wkiri commented 2 years ago

Consistency review is now complete for 2004-2013 and 2018-2020.

wkiri commented 1 year ago

Consistency review is complete for 2004-2020! Next I will update the aliases table and proceed to create a MER-B SQLite DB.

wkiri commented 1 year ago

I generated a new JSON file with contents only for MER-B documents with at least one relevant Target. Files: /proj/mte/results/pdfpaths-mer-b-withtarget.list JSON: /proj/mte/results/mer-b-jsre-v2-ads-gaz-withtarget.jsonl

The MER-B SQLite database is at /proj/mte/sqlite/mte_mer1_all_v3.0.0.db

I have some remaining quality control checks to do - nearly done!

wkiri commented 1 year ago

I have generated the MTE bundle v3.0 that includes MER-B (Opportunity, mer1) content as well as some minor updates to the MER-A (Spirit, mer2) annotations. The bundle is at /proj/mte/pds-deliveries/bundle_v3.0.0/ and passes validation using v2.1.4 of the validate tool.

Notes:

wkiri commented 1 year ago

I downloaded the latest validate tool (version 2.3.0) and the bundle also passes with that version. I will send this bundle to Scott at the PDS.

wkiri commented 1 year ago

Feedback from Scott VanBommel:

The only comment we have pertains to alias listings. Please provide a comprehensive list of all aliases for each target. For example, Berry_Bowl has multiple aliases but is not listed in the alias table.

Berry_Bowl,_Empty Berry_Bowl Berry_Bowl_Empty Berry_Bowl_Full Berrybowl Berrybowl_Empty Berrybowl_Full

wkiri commented 1 year ago

I went through the MER-B target list and added aliases that were missing. I generated a new sqlite database in: /proj/mte/sqlite/mte_mer1_all_v3.1.0.db and it is linked to /proj/mte/sqlite/mte_mer1_all.db and therefore searchable at https://ml.jpl.nasa.gov/mte/mer1/

I checked the bundle (now v3.1) with validate and delivered it to Scott. The bundle files are in /proj/mte/pds-deliveries/bundle_v3.1.0/

wkiri commented 10 months ago

The new bundle was posted at the PDS Geosciences node on Sept. 14, 2022.