Closed wkiri closed 2 years ago
Changes to address this feedback will be committed to branch issue42-stein
since they are related to the same delivery.
@stevenlujpl As part of these updates, I would like to remove targettab
(dictionary mapping MPF and PHX aliases to canonical names) entirely from name_utils.py
. We are not using this dictionary any more. However, I checked and the unary_parser.py
uses it, as part of old_canonical_target_name()
. Do you think it is safe to remove this dependence, and have the unary parser use canonical_name()
as the rest of the code now does?
All feedback has been addressed, and a complete response plus the updated bundle have been sent to Scott VanBommel.
Final feedback from peer reviewers:
Also, I noticed that "Calcium" was listed as both an "Element" (correct) and a "Mineral" (incorrect) in our MER-A (mer2) database. It turns out that this is not due to an incorrect annotation, but instead an NER error . "Ca" is listed as a "Mineral" NER in the source .json file (/proj/mte/results/mer-a-jsre-v2-ads-gaz-CHP-all397.jsonl
) when it is part of "Ca-sulfates" in 2006_1472. This is corrected in the annotations to be of type "Element", but the remove-orphans step in update_sqlite.py
does not remove "Calcium" from the components
table because "Calcium" still appears in the contains
table (due to the element appearing in at least one valid contains relation), and the components
table is not refreshed based on the annotations (probably so that we can run update_sqlite.py
several times to progressively add/update if desired?).
A possible solution would be for the remove-orphans step to regenerate components
and properties
at the end of processing so that they accurately reflect content in the documents at that point. However, it is worth more thought to determine if this is the best solution.
The first item in the final requests was addressed by an update to readme.txt
.
The second item isn't a request but instead a critique. I definitely see the merit of this comment. One idea might be to add a "see also" table to explicitly connect related terms and save the individual user from that effort. I think that is better than overwriting the original document content in the MTE database. A "see also" table would identify potential connections, without requiring that every occurrence of "ol" is assumed to mean "olivine". We should preserve this in a "future improvements" wishlist, to be investigated if we're able to obtain more resources.
I have created a wishlist here; feel free to add ideas as they arise: https://github.com/wkiri/MTE/wiki/MTE-Wishlist
Reviewer 1:
I reviewed the readme.txt file in the document collection and looked at the data files in the MER collection.
[x] 1) I understand that this dataset is structured like a relational database. However, I feel that a typical user would find it hard to use without some sort of user interface to connect the separate data files.
[x] 2) The documentation should add a note that MER2 corresponds to the Spirit rover that is also sometimes referred to as MERA.
[x] 3) The documentation should make it clear that the resulting target list is based on what has been published in LPSC abstracts and is not necessarily a complete list of targets defined and observed by the mission.
[x] 4) It seems to me that the structure of the documents product would make it hard to expand it to include journal publications. For example, it includes a conference name, which is not applicable to journals and an abstract number that may not be applicable to non-LPI conferences. Maybe the product should be renamed to abstracts or lpsc_abstracts.
5) Components.csv:
Reviewer 2:
The data are presented from a ML / data science viewpoint and not from a planetary scientist (or general science user) viewpoint. The structure makes perfect sense for a programmer, but a scientist has to build a system to link from table to table to get any answers. An average science user who wants to find abstracts that reference a given target has to relink all of the tables using a variety of IDs along the way. Attention has to be paid to the aliases table to find any name variations like misspellings and abbreviations. It’s not that the structure is terrible, it’s just not friendly to the common user.
I think the targets.csv table should have columns something like (target_id, target_name, mte_target_id, met_target_name, mission) and then include both the target_id and mte_target_id in the other tables. The target_id and _name refer to the canonical name, so the set of columns could be (target_id_canonical, target_name_canonical, target_id_mte, target_name_mte, mission).
Perhaps the MTE should include a “first order” table that quickly and easily gets users 90% of the way to a reference:
Plenty of repetition, but this sort of table is easily scanned by human eyes and can be loaded into Excel (or wherever) for column-based sorting. For a user looking at the archive volume as it is, the filename “target.csv” is glowing red-hot and carries the suggestion that the file is a list of official targets. After opening the file, the user sees lots of names with typos and extraneous-looking appendages.
Reviewer 3:
bundle_mars_target_encyclopedia.xml:
collection_mpf.xml, collection_phx.xml, and collection_document.xml:
data_mer/collection_mer2.xml:
This can't be right. I noticed that all the labels (including MPF and PHX) have these dates (the bundle XML is slightly different). Where do these dates come from? Shouldn't they be mission-specific? Scott: I agree with the reviewer and think what they were getting at was to have the bundle span all time, and the individual collections span only from when the mission data starts (i.e., MER starts in 03 or 04, not 97).
document/collection_document.xml:
document/readme.xml:
MTE-schema.jpg and MTE-schema.xml
All data_mpf and data_phx XML labels:
data_mpf/has_property.xml, data_mpf/mentions.xml, data_mpf/targets.xml, data_phx/contains.xml, data_phx/documents.xml, data_phx/has_property.xml, data_phx/mentions.xml, data_phx/targets.xml.
data/mer/has_property.xml, data_mpf/has_property.xml, and data_phx/has_property.xml
data_mpf/properties.xml, data_mpf/sentences.xml, data_phx/sentences.xml
[x] Should be V1.3, since changes were made to the files. Remember to update the inventory file also.
Scott VanBommel:
[x] I think what is missing from the readme file is discussion of how the target names were determined, and how the canonical names were selected when they appear in the aliases list. (Scott's comment: this ties into the concern, which I share, that a user accepts the information presented as absolute truth - however, targets.csv and aliases.csv are neither authoritative nor comprehensive.)
[x] Change the part in the readme file that says “when a target name is re-used” to “when a target name is used in multiple missions”