Closed wkiri closed 10 months ago
The MER-B JSON file is complete and available at:
/proj/mte/results/mer-b-jsre-v2-ads-gaz.jsonl
The list of 1635 documents included is in
/proj/mte/results/pdfpaths-mer-b.list
The automatically generated annotations are browsable at: https://ml.jpl.nasa.gov/mte/brat/#/mer-b/all-jsre-v2 There are a lot of spurious Targets annotated, and 1454/1635 documents have at least one Target present. As with MER-B, it seems worth pruning the target list to "salient" Target names (excluding Top, Bottom, Greeley, Mariner, Venera, Base, landing, stripes, lost, tracks, etc.) and re-running.
I've trained a new NER model using the salient MER-B targets. It is available at
/proj/mte/trained_models/ner_MERB-property-salient.ser.gz
I am re-parsing these documents, which will take another 3.5 hours :)
Note that there is one document (2006_2401.pdf) that generates an Out of Memory error from jSRE while being processed. It does generate Contains relations, so I assume it fails on the HasProperty model. It generates a LOT of Contains relations (439) in this document, so this could contribute to the memory issue. For now I will just work with what we get and add any missing HasProperty relations by hand.
The new output (using salient targets only) is now available at https://ml.jpl.nasa.gov/mte/brat/#/mer-b/all-jsre-v2 This greatly reduced the number of spurious Target annotations. We now have 1166/1635 documents with at least one Target. I moved the previous results to https://ml.jpl.nasa.gov/mte/brat/#/mer-b/all-jsre-v2-alltargets if anyone wants to browse them.
I am now performing a quick triage to remove remaining residual targets and reduce the review effort ahead.
Triage is complete for docs from 2004 and 2005. This reduces the document set for those two years from 241 to 28 documents, which is good news. I will continue triaging the remaining years, but on Monday I will start assigning review docs to the team to get things going for the years that are ready.
Triage is complete for 2004-2006, leaving 60 documents in that span that need review. I've asked Matt, Leslie, and Raymond to each review 20 documents.
Reviews are complete for 2004-2006 (n=60), and I've assigned 19 docs from 2007 to Raymond and 21 docs from 2008 and 2009 to Matt.
Reviews are complete for 2004-2009 (except 2007) (n=81), and I've assigned 18 docs from 2010 to Matt.
Please note: to address the above question about versioning, PDS made us go to version 2.0 when we added mer2, so likely we should advance the bundle to version 3.0 with mer1. (Individual files will advance versions only as needed)
Reviews are complete for 2004-2009 (n=100), Matt is working on 2010 (n=18), and I've assigned 2011-2012 (n=12+15) to Leslie.
Triage is complete for 2013, yielding 16 more documents (total of 161 for Opportunity so far; 100 are reviewed).
Triage is complete for 2014, 2015, 2016, and 2017, yielding 59 more documents (total of 220 for Opportunity so far; 100 are reviewed). I've assigned 2013 (n=16) to Raymond.
Triage is complete for 2018, with 15 more documents (total of 235 for Opportunity; 118 are reviewed).
Please note: Content under /var (where our brat .ann reviewed files are stored) is not backed up. Therefore, the reviewed results are rsync'd manually to
/proj/mte/results/<mission>-reviewed+properties-v2/
/proj/mte/results/<mission>-journals/
These locations are backed up. A lot of time goes in to reviewing the documents and we don't want to lose that work :)
Triage is complete for 2019 and 2020, which completes this set! This added 21 more documents, for a total of 256 for Opportunity. Raymond has 2014-2015, Matt has 2016-2017, and I will review 2018-2020.
Reviews are complete for 2004-2013 and 2016-2020, with just 2014-2015 to go. Getting close!
Reviews are complete, and I am now going over them to achieve consistency. This is taking some time :) Hopefully finished soon!
Consistency review is complete for 2004-2006 and 2018-2020. Eleven years left... :)
Consistency review is now complete for 2004-2008 and 2018-2020.
Consistency review is now complete for 2004-2013 and 2018-2020.
Consistency review is complete for 2004-2020! Next I will update the aliases table and proceed to create a MER-B SQLite DB.
I generated a new JSON file with contents only for MER-B documents with at least one relevant Target.
Files: /proj/mte/results/pdfpaths-mer-b-withtarget.list
JSON: /proj/mte/results/mer-b-jsre-v2-ads-gaz-withtarget.jsonl
The MER-B SQLite database is at
/proj/mte/sqlite/mte_mer1_all_v3.0.0.db
I have some remaining quality control checks to do - nearly done!
I have generated the MTE bundle v3.0 that includes MER-B (Opportunity, mer1) content as well as some minor updates to the MER-A (Spirit, mer2) annotations. The bundle is at
/proj/mte/pds-deliveries/bundle_v3.0.0/
and passes validation using v2.1.4 of the validate tool.
Notes:
diff -r bundle_v2.0.0/ bundle_v3.0.0/ | less
generate_pds4_bundle.py
to require that the collection name is present in each LID_VID added to a given inventory file. This was needed since mer1 and mer2 are subdirectories in the same data_mer
directory, and without this check the script was walking all subdirs and putting files from both missions into the mer1 file (which is generated second).I downloaded the latest validate
tool (version 2.3.0) and the bundle also passes with that version. I will send this bundle to Scott at the PDS.
Feedback from Scott VanBommel:
The only comment we have pertains to alias listings. Please provide a comprehensive list of all aliases for each target. For example, Berry_Bowl has multiple aliases but is not listed in the alias table.
Berry_Bowl,_Empty Berry_Bowl Berry_Bowl_Empty Berry_Bowl_Full Berrybowl Berrybowl_Empty Berrybowl_Full
I went through the MER-B target list and added aliases that were missing. I generated a new sqlite database in:
/proj/mte/sqlite/mte_mer1_all_v3.1.0.db
and it is linked to
/proj/mte/sqlite/mte_mer1_all.db
and therefore searchable at
https://ml.jpl.nasa.gov/mte/mer1/
I checked the bundle (now v3.1) with validate and delivered it to Scott. The bundle files are in
/proj/mte/pds-deliveries/bundle_v3.1.0/
The new bundle was posted at the PDS Geosciences node on Sept. 14, 2022.
aliases
table