wkiri / MTE

Mars Target Encyclopedia
Apache License 2.0
5 stars 0 forks source link

Add aliases table to MTE DB #9

Closed wkiri closed 2 years ago

wkiri commented 3 years ago

This would allow us to link known aliases of the same target, like Jake_M and Jake_Matijevic.

wkiri commented 2 years ago

Btw, Kiri noted that "magnestite" appears in our PHX database as a Mineral, but it is likely a typo for "magnetite". So we may want to handle aliasing for entities beyond just Target.

@stevenlujpl has some ideas for tools that can auto-suggest likely spelling corrections. For Target this might not make sense, but for Element, Mineral, and Property it could be useful.

Similarly, Google has a "did you mean" capability; I'm not sure if this is a public API: Screen Shot 2021-09-15 at 11 34 28 AM

wkiri commented 2 years ago

Next steps:

Here are some known MER-A aliases: 'Bb': 'Bread Box', 'Bellin gshausen': 'Bellingshausen', 'Champange': 'Champagne', 'Clara Zap': 'Clara Zapf', 'Commanche': 'Comanche', 'East Val ley': 'East Valley', 'Fuzzysmith': 'Fuzzy_Smith', 'Homeplate': 'Home_Plate', 'Humphries': 'Humphrey', 'Humphry': 'Humphrey', 'Husband Hills': 'Husband_Hill', 'Methuslah': 'Methuselah', 'Pearls': 'String_of_Pearls', 'Pes apallo': 'Pesapallo', 'POG': 'Pot_of_Gold', "Pot o' Gold": 'Pot_of_Gold', 'Pot-of-Gold': 'Pot_of_Gold', 'PR': 'Paso_Robles', 'Tor quas': 'Torquas',

stevenlujpl commented 2 years ago

@wkiri I've updated the following scripts to add aliases table to MTE DB. Please note that the changes I made are in the issue9-aliases branch now.

I tested the scripts using 7 MER2 documents in /home/youlu/MTE/working_dir/aliases_table/docs/ directory. The aliases CSV file I used for testing is /home/youlu/MTE/working_dir/aliases_table/aliases.csv, and the format of the aliases CSV file is shown below. Each row is an alias entry. The first column is the verbatim target name and the second column is the canonical target name. The verbatim target names and canonical target names are separated by commas.

Bellin gshausen,Bellingshausen
Champange,Champagne
Commanche,Comanche
East Val ley,East Valley
Homeplate,Home_Plate

The final DB file that contains the newly populated aliases table can be found at /home/youlu/MTE/working_dir/aliases_table/mer2.db.

stevenlujpl commented 2 years ago

@wkiri We can talk about these changes in the next MTE meeting. If everything looks good, I will update the README file and add the aliases table schema on the WIKI page.

stevenlujpl commented 2 years ago

I think we also need to update the PDS4 bundle generation script to include the aliases table.

wkiri commented 2 years ago

@stevenlujpl Fantastic! I'll give this a try soon.

wkiri commented 2 years ago

@stevenlujpl I tried this out on two documents and with some aliases and confirmed that it correctly generated the aliases table. Fro these two MER-A documents, previously 1062 distinct targets are in the targets table, and with the new code, it is 1139 (since all naming variants are included instead of mapping to canonical names).

One thing we might want to consider is this step in update_sqlite.py:

[INFO] Known mission targets from the target list /home/wkiri/Research/MTE/git/ref/MER/MERA-targets-final.txt have been added to the DB file.

This step may actually be one place where we do still want to map to canonical target names, so we do not add all of the naming variants that are in that list. What do you think? Effectively, the naming variants should appear in the aliases table instead. I don't think we want to remove aliases from the target list file (since they were included to increase our recall of targets when processing the files), but we probably don't want them to occur in the targets table in the final SQL file.

stevenlujpl commented 2 years ago

@wkiri Great suggestion. I've modified update_sqlite.py to map the targets provided by the target_list argument to canonical names.

stevenlujpl commented 2 years ago

@wkiri I've added the ability to generate the PDS4 bundle with the aliases table. I also validate the test mer2 bundle with the validate tool, and everything looks good. Please note that all the changes I made are in the issue9-aliases branch now.

I want to ask your opinion on the <version_id> and <Modification_History> fields. Currently, I only updated the <version_id> and <Modification_History> fields for the bundle XML file (please see below. I changed the version id from 1.2 to 1.3, and I updated the modification history to document the addition of the aliases table). The <version_id> and <Modification_History> fields of the collection XML files (e.g., mer1_targets.txml, mer1_components.txml, etc.) remind unchanged. The question is do we want to also update the <version_id> and <Modification_History> fields of the collection XML files. I think the answer is no. Technically, we only changed the bundle (because of the addition of the aliases table), not the collection CSV/XML files. Please let me know what you think. Thanks.

https://github.com/wkiri/MTE/blob/bcbdc8f3dfea58c0fe05026d0fa560c8caa6a132/src/pds4_bundle_template/bundle_mars_target_encyclopedia.txml#L9

https://github.com/wkiri/MTE/blob/bcbdc8f3dfea58c0fe05026d0fa560c8caa6a132/src/pds4_bundle_template/bundle_mars_target_encyclopedia.txml#L22-L39

stevenlujpl commented 2 years ago

The mer2 test bundle can be found at /home/youlu/MTE/working_dir/aliases_table/mars_target_encyclopedia.

wkiri commented 2 years ago

Thanks, @stevenlujpl ! I've updated the description of options for update_sqlite.py in the README here (please check for accuracy when you have a chance): https://github.com/wkiri/MTE/blob/master/src/README.md

I think we need to change canonical_name() to look up targets in the aliases table rather than using the targettab dictionary.

wkiri commented 2 years ago

For the bundle and versioning, I think we do need to update the version of the collection inventory XML files for MPF and PHX. Currently they indicate that in v1.2 we added the contains, has_property, components, and properties tables. It makes sense to add v1.3 of the addition of the aliases table since the inventory will have a new member (aliases.csv).

Technically we would not need to change the version of the unaffected files, like mpf_targets.csv (.xml). But it looks like we did change it for v1.2 even though it was not affected:

https://github.com/wkiri/MTE/blob/bcbdc8f3dfea58c0fe05026d0fa560c8caa6a132/src/pds4_bundle_template/mpf_targets.txml#L22-L34

and currently the full version history of the bundle is in the new file, mpf_aliases.xml (was that intentional?):

https://github.com/wkiri/MTE/blob/bcbdc8f3dfea58c0fe05026d0fa560c8caa6a132/src/pds4_bundle_template/mpf_aliases.txml#L22-L34

It may make sense for mpf_aliases.xml to be at v1.0, while mpf_targets.xml might more accurately be not v1.2 but v1.1 (because using aliases may change the total number of targets in the mentions table), while mpf_properties.xml (etc.) should be at their own v1.0 even though they were introduced in v1.2 of the bundle. But this could get quite confusing. The simpler route is to advance all of the file versions to 1.3 so we know they are the versions that were included with v1.3 of the bundle. I think this may have guided the logic of our v1.2 versioning as well.

wkiri commented 2 years ago

When processing MPF aliases, I discovered that we need to support UTF-8 names as well (Soufflé).