wkiri / MTE

Mars Target Encyclopedia
Apache License 2.0
5 stars 0 forks source link

Address PDS feedback on MER-A bundle #46

Closed wkiri closed 2 years ago

wkiri commented 2 years ago

Reviewer 1:

I reviewed the readme.txt file in the document collection and looked at the data files in the MER collection.

5) Components.csv:


Reviewer 2:

The data are presented from a ML / data science viewpoint and not from a planetary scientist (or general science user) viewpoint. The structure makes perfect sense for a programmer, but a scientist has to build a system to link from table to table to get any answers. An average science user who wants to find abstracts that reference a given target has to relink all of the tables using a variety of IDs along the way. Attention has to be paid to the aliases table to find any name variations like misspellings and abbreviations. It’s not that the structure is terrible, it’s just not friendly to the common user.

I think the targets.csv table should have columns something like (target_id, target_name, mte_target_id, met_target_name, mission) and then include both the target_id and mte_target_id in the other tables. The target_id and _name refer to the canonical name, so the set of columns could be (target_id_canonical, target_name_canonical, target_id_mte, target_name_mte, mission).

Perhaps the MTE should include a “first order” table that quickly and easily gets users 90% of the way to a reference:

Plenty of repetition, but this sort of table is easily scanned by human eyes and can be loaded into Excel (or wherever) for column-based sorting. For a user looking at the archive volume as it is, the filename “target.csv” is glowing red-hot and carries the suggestion that the file is a list of official targets. After opening the file, the user sees lots of names with typos and extraneous-looking appendages.


Reviewer 3:

bundle_mars_target_encyclopedia.xml:

collection_mpf.xml, collection_phx.xml, and collection_document.xml:

data_mer/collection_mer2.xml:

document/collection_document.xml:

document/readme.xml:

MTE-schema.jpg and MTE-schema.xml

All data_mpf and data_phx XML labels:

data_mpf/has_property.xml, data_mpf/mentions.xml, data_mpf/targets.xml, data_phx/contains.xml, data_phx/documents.xml, data_phx/has_property.xml, data_phx/mentions.xml, data_phx/targets.xml.

data/mer/has_property.xml, data_mpf/has_property.xml, and data_phx/has_property.xml

data_mpf/properties.xml, data_mpf/sentences.xml, data_phx/sentences.xml

wkiri commented 2 years ago

Changes to address this feedback will be committed to branch issue42-stein since they are related to the same delivery.

wkiri commented 2 years ago

@stevenlujpl As part of these updates, I would like to remove targettab (dictionary mapping MPF and PHX aliases to canonical names) entirely from name_utils.py. We are not using this dictionary any more. However, I checked and the unary_parser.py uses it, as part of old_canonical_target_name(). Do you think it is safe to remove this dependence, and have the unary parser use canonical_name() as the rest of the code now does?

https://github.com/wkiri/MTE/blob/b5e71c84ce7e46145488fef873efce2c0caf2b84/src/unary_parser.py#L277-L295

wkiri commented 2 years ago

All feedback has been addressed, and a complete response plus the updated bundle have been sent to Scott VanBommel.

wkiri commented 2 years ago

Final feedback from peer reviewers:

wkiri commented 2 years ago

Also, I noticed that "Calcium" was listed as both an "Element" (correct) and a "Mineral" (incorrect) in our MER-A (mer2) database. It turns out that this is not due to an incorrect annotation, but instead an NER error . "Ca" is listed as a "Mineral" NER in the source .json file (/proj/mte/results/mer-a-jsre-v2-ads-gaz-CHP-all397.jsonl) when it is part of "Ca-sulfates" in 2006_1472. This is corrected in the annotations to be of type "Element", but the remove-orphans step in update_sqlite.py does not remove "Calcium" from the components table because "Calcium" still appears in the contains table (due to the element appearing in at least one valid contains relation), and the components table is not refreshed based on the annotations (probably so that we can run update_sqlite.py several times to progressively add/update if desired?).

A possible solution would be for the remove-orphans step to regenerate components and properties at the end of processing so that they accurately reflect content in the documents at that point. However, it is worth more thought to determine if this is the best solution.

wkiri commented 2 years ago

The first item in the final requests was addressed by an update to readme.txt.

The second item isn't a request but instead a critique. I definitely see the merit of this comment. One idea might be to add a "see also" table to explicitly connect related terms and save the individual user from that effort. I think that is better than overwriting the original document content in the MTE database. A "see also" table would identify potential connections, without requiring that every occurrence of "ol" is assumed to mean "olivine". We should preserve this in a "future improvements" wishlist, to be investigated if we're able to obtain more resources.

I have created a wishlist here; feel free to add ideas as they arise: https://github.com/wkiri/MTE/wiki/MTE-Wishlist