Address PDS feedback on MER-A bundle

wkiri commented 2 years ago

Reviewer 1:

I reviewed the readme.txt file in the document collection and looked at the data files in the MER collection.

[x] 1) I understand that this dataset is structured like a relational database. However, I feel that a typical user would find it hard to use without some sort of user interface to connect the separate data files.
[x] 2) The documentation should add a note that MER2 corresponds to the Spirit rover that is also sometimes referred to as MERA.
[x] 3) The documentation should make it clear that the resulting target list is based on what has been published in LPSC abstracts and is not necessarily a complete list of targets defined and observed by the mission.
[x] 4) It seems to me that the structure of the documents product would make it hard to expand it to include journal publications. For example, it includes a conference name, which is not applicable to journals and an abstract number that may not be applicable to non-LPI conferences. Maybe the product should be renamed to abstracts or lpsc_abstracts.

5) Components.csv:

[x] - Some listed components are not element or mineral. For example Hawaiite and Hyaloclastite.
[x] - Abbreviations should be expanded. For example, Feot to FeO-total, px to pyroxene.
[x] - Single and plural entries of the same thing should be combined. The reason for having both pyroxene and pyroxenes, for example, is not clear. If it is important, please explain. Another example is Npox and Np-ox.
[x] - Feot and Feotot is listed as a mineral. The abstract could be referring to the total iron oxide content in the composition of a sample and not necessarily to iron-oxide mineral.
[x] - It is not clear why the second component of chemical compounds are shown in lower case and not in the standard chemical notation for elements. This also applies to the contains product.

Reviewer 2:

The data are presented from a ML / data science viewpoint and not from a planetary scientist (or general science user) viewpoint. The structure makes perfect sense for a programmer, but a scientist has to build a system to link from table to table to get any answers. An average science user who wants to find abstracts that reference a given target has to relink all of the tables using a variety of IDs along the way. Attention has to be paid to the aliases table to find any name variations like misspellings and abbreviations. It’s not that the structure is terrible, it’s just not friendly to the common user.

I think the targets.csv table should have columns something like (target_id, target_name, mte_target_id, met_target_name, mission) and then include both the target_id and mte_target_id in the other tables. The target_id and _name refer to the canonical name, so the set of columns could be (target_id_canonical, target_name_canonical, target_id_mte, target_name_mte, mission).

Perhaps the MTE should include a “first order” table that quickly and easily gets users 90% of the way to a reference:

ID columns (the target names and IDs)
Reference type (literature mention, element/mineral/property)
Reference value (for element/mineral/property: “oxygen”, “carbonate”, “lag_deposit”, etc.)
Document reference (id, title, author, year, URL)

Plenty of repetition, but this sort of table is easily scanned by human eyes and can be loaded into Excel (or wherever) for column-based sorting. For a user looking at the archive volume as it is, the filename “target.csv” is glowing red-hot and carries the suggestion that the file is a list of official targets. After opening the file, the user sees lots of names with typos and extraneous-looking appendages.

[x] In short, my primary recommendation is that the science community probably is better served if the MTE product has two distinct parts: a user-friendly table (described above) and the formal computer science portion that is not the first thing a normal user would see.

Reviewer 3:

bundle_mars_target_encyclopedia.xml:

[x] - I would make the bundle V2.0, since an entire collection was added.
[x] - Modification_Detail description should say that the data_mer2 collection was added. Maybe something like, "Added data_mer2 collection. Also added aliases table to data_mpf and data_phx collections" if that's accurate.

collection_mpf.xml, collection_phx.xml, and collection_document.xml:

[x] - Version 1.2 of these files online are named collection_mpf_inventory.xml, collection_phx_inventory.xml, and collection_document_inventory.xml. Changing the file names is going to mess things up at the EN. If you insist on changing them, I think they might have to start over at Version 1.0, but we'd need to check with Richard. I'd just stick with the online names, and follow the same naming convention for the new MER2 inventory (collection_mer2_inventory.xml). Recommendation is to revert to the “inventory” naming convention because the online files have already been registered with EN, and this might cause problems. Also, it’s fine to have “inventory” in the collection label filename. Many of bundles at GEO use this convention.

data_mer/collection_mer2.xml:

[x] - Rename to collection_mer2_inventory.xml per above.
[x] - Change to DOS line endings.
[x] - 1997-07-04Z 2020-03-16Z
This can't be right. I noticed that all the labels (including MPF and PHX) have these dates (the bundle XML is slightly different). Where do these dates come from? Shouldn't they be mission-specific? Scott: I agree with the reviewer and think what they were getting at was to have the bundle span all time, and the individual collections span only from when the mission data starts (i.e., MER starts in 03 or 04, not 97).

document/collection_document.xml:

[x] - Modification_Detail description should say that the MTE-schema.jpg file was added and readme.txt was updated. "Add aliases table" should be removed.

document/readme.xml:

[x] - Modification_Detail description should say something like "Updated to reflect addition of aliases table and mer2 data collection."

MTE-schema.jpg and MTE-schema.xml

[x] - Lowercase the file names. Do same for pointer in MTE-schema.xml. Not a requirement, but we favour lowercase filenames at GEO. Also, this is the only file in the bundle that is lowercase. LIDs have to be lowercase, so we try to have filenames match.

All data_mpf and data_phx XML labels:

[x] Why was the for V1.2 changed from 2021-06-07 to 2021-06-01? The labels online have 2021-06-07.

data_mpf/has_property.xml, data_mpf/mentions.xml, data_mpf/targets.xml, data_phx/contains.xml, data_phx/documents.xml, data_phx/has_property.xml, data_phx/mentions.xml, data_phx/targets.xml.

[x] - V1.3 Modification_Detail descriptions say "Add aliases table." These do not reflect the edits that were actually made to the individual files.

data/mer/has_property.xml, data_mpf/has_property.xml, and data_phx/has_property.xml

[x] - Change to DOS line endings.

data_mpf/properties.xml, data_mpf/sentences.xml, data_phx/sentences.xml

[x] Should be V1.3, since changes were made to the files. Remember to update the inventory file also.

Scott VanBommel:
[x] I think what is missing from the readme file is discussion of how the target names were determined, and how the canonical names were selected when they appear in the aliases list. (Scott's comment: this ties into the concern, which I share, that a user accepts the information presented as absolute truth - however, targets.csv and aliases.csv are neither authoritative nor comprehensive.)
[x] Change the part in the readme file that says “when a target name is re-used” to “when a target name is used in multiple missions”

wkiri commented 2 years ago

Changes to address this feedback will be committed to branch issue42-stein since they are related to the same delivery.

wkiri commented 2 years ago

@stevenlujpl As part of these updates, I would like to remove targettab (dictionary mapping MPF and PHX aliases to canonical names) entirely from name_utils.py. We are not using this dictionary any more. However, I checked and the unary_parser.py uses it, as part of old_canonical_target_name(). Do you think it is safe to remove this dependence, and have the unary parser use canonical_name() as the rest of the code now does?

https://github.com/wkiri/MTE/blob/b5e71c84ce7e46145488fef873efce2c0caf2b84/src/unary_parser.py#L277-L295

wkiri commented 2 years ago

All feedback has been addressed, and a complete response plus the updated bundle have been sent to Scott VanBommel.

wkiri commented 2 years ago

Final feedback from peer reviewers:

[x] I would like to see an addition in the “Methods” section of the readme.txt file that includes response points about them choosing not to make assumptions about properties, minerals, etc. Also how they chose to reserve the Element category for items in the periodic table and Mineral for more complex entities or measurements. Even if this information in captured in the referenced paper, it should be included in the readme.txt file, too.
[x] Keeping values as abbreviations in some cases and not in others for the same thing, or some with hyphens and others not, or other similar variations for the basically the same value, seems to me to be problematic for any search interface. How do you build a search capability to catch these variations without such a tool analyzing the database, finding these problems, and building tables to correct them?

wkiri commented 2 years ago

Also, I noticed that "Calcium" was listed as both an "Element" (correct) and a "Mineral" (incorrect) in our MER-A (mer2) database. It turns out that this is not due to an incorrect annotation, but instead an NER error . "Ca" is listed as a "Mineral" NER in the source .json file (/proj/mte/results/mer-a-jsre-v2-ads-gaz-CHP-all397.jsonl) when it is part of "Ca-sulfates" in 2006_1472. This is corrected in the annotations to be of type "Element", but the remove-orphans step in update_sqlite.py does not remove "Calcium" from the components table because "Calcium" still appears in the contains table (due to the element appearing in at least one valid contains relation), and the components table is not refreshed based on the annotations (probably so that we can run update_sqlite.py several times to progressively add/update if desired?).

A possible solution would be for the remove-orphans step to regenerate components and properties at the end of processing so that they accurately reflect content in the documents at that point. However, it is worth more thought to determine if this is the best solution.

wkiri commented 2 years ago

The first item in the final requests was addressed by an update to readme.txt.

The second item isn't a request but instead a critique. I definitely see the merit of this comment. One idea might be to add a "see also" table to explicitly connect related terms and save the individual user from that effort. I think that is better than overwriting the original document content in the MTE database. A "see also" table would identify potential connections, without requiring that every occurrence of "ol" is assumed to mean "olivine". We should preserve this in a "future improvements" wishlist, to be investigated if we're able to obtain more resources.

I have created a wishlist here; feel free to add ideas as they arise: https://github.com/wkiri/MTE/wiki/MTE-Wishlist

wkiri / MTE

Address PDS feedback on MER-A bundle #46

[x] Should be V1.3, since changes were made to the files. Remember to update the inventory file also.