Closed wkiri closed 2 years ago
Btw, Kiri noted that "magnestite" appears in our PHX database as a Mineral, but it is likely a typo for "magnetite". So we may want to handle aliasing for entities beyond just Target.
@stevenlujpl has some ideas for tools that can auto-suggest likely spelling corrections. For Target this might not make sense, but for Element, Mineral, and Property it could be useful.
Similarly, Google has a "did you mean" capability; I'm not sure if this is a public API:
Next steps:
ingest_sqlite.py
to NOT map targets to canonical_target_name()
ingest_sqlite.py
aliases
table in schema (and sqlite_mte.py
): target_name (verbatim), canonical_nameupdate_sqlite.py
to be able to read in an aliases.csv
file to populate the aliases
tableupdate_sqlite.py
to NOT map targets to canonical_target_name()
aliases
tablegenerate_pds4_bundle.py
to create aliases
XML label filealiases.csv
and re-generate MPF bundle content (testing)aliases.csv
and re-generate PHX bundle content (testing)targets
table in bundles (mentions
etc. should not change)aliases.csv
and generate DBHere are some known MER-A aliases: 'Bb': 'Bread Box', 'Bellin gshausen': 'Bellingshausen', 'Champange': 'Champagne', 'Clara Zap': 'Clara Zapf', 'Commanche': 'Comanche', 'East Val ley': 'East Valley', 'Fuzzysmith': 'Fuzzy_Smith', 'Homeplate': 'Home_Plate', 'Humphries': 'Humphrey', 'Humphry': 'Humphrey', 'Husband Hills': 'Husband_Hill', 'Methuslah': 'Methuselah', 'Pearls': 'String_of_Pearls', 'Pes apallo': 'Pesapallo', 'POG': 'Pot_of_Gold', "Pot o' Gold": 'Pot_of_Gold', 'Pot-of-Gold': 'Pot_of_Gold', 'PR': 'Paso_Robles', 'Tor quas': 'Torquas',
@wkiri I've updated the following scripts to add aliases table to MTE DB. Please note that the changes I made are in the issue9-aliases
branch now.
I tested the scripts using 7 MER2 documents in /home/youlu/MTE/working_dir/aliases_table/docs/
directory. The aliases CSV file I used for testing is /home/youlu/MTE/working_dir/aliases_table/aliases.csv
, and the format of the aliases CSV file is shown below. Each row is an alias entry. The first column is the verbatim target name and the second column is the canonical target name. The verbatim target names and canonical target names are separated by commas.
Bellin gshausen,Bellingshausen
Champange,Champagne
Commanche,Comanche
East Val ley,East Valley
Homeplate,Home_Plate
The final DB file that contains the newly populated aliases table can be found at /home/youlu/MTE/working_dir/aliases_table/mer2.db
.
@wkiri We can talk about these changes in the next MTE meeting. If everything looks good, I will update the README file and add the aliases table schema on the WIKI page.
I think we also need to update the PDS4 bundle generation script to include the aliases table.
@stevenlujpl Fantastic! I'll give this a try soon.
@stevenlujpl I tried this out on two documents and with some aliases and confirmed that it correctly generated the aliases table. Fro these two MER-A documents, previously 1062 distinct targets are in the targets
table, and with the new code, it is 1139 (since all naming variants are included instead of mapping to canonical names).
One thing we might want to consider is this step in update_sqlite.py
:
[INFO] Known mission targets from the target list /home/wkiri/Research/MTE/git/ref/MER/MERA-targets-final.txt have been added to the DB file.
This step may actually be one place where we do still want to map to canonical target names, so we do not add all of the naming variants that are in that list. What do you think? Effectively, the naming variants should appear in the aliases
table instead. I don't think we want to remove aliases from the target list file (since they were included to increase our recall of targets when processing the files), but we probably don't want them to occur in the targets
table in the final SQL file.
@wkiri Great suggestion. I've modified update_sqlite.py
to map the targets provided by the target_list
argument to canonical names.
@wkiri I've added the ability to generate the PDS4 bundle with the aliases table. I also validate the test mer2 bundle with the validate
tool, and everything looks good. Please note that all the changes I made are in the issue9-aliases
branch now.
I want to ask your opinion on the <version_id>
and <Modification_History>
fields. Currently, I only updated the <version_id>
and <Modification_History>
fields for the bundle XML file (please see below. I changed the version id from 1.2 to 1.3, and I updated the modification history to document the addition of the aliases table). The <version_id>
and <Modification_History>
fields of the collection XML files (e.g., mer1_targets.txml, mer1_components.txml, etc.) remind unchanged. The question is do we want to also update the <version_id>
and <Modification_History>
fields of the collection XML files. I think the answer is no. Technically, we only changed the bundle (because of the addition of the aliases table), not the collection CSV/XML files. Please let me know what you think. Thanks.
The mer2 test bundle can be found at /home/youlu/MTE/working_dir/aliases_table/mars_target_encyclopedia
.
Thanks, @stevenlujpl ! I've updated the description of options for update_sqlite.py
in the README here (please check for accuracy when you have a chance):
https://github.com/wkiri/MTE/blob/master/src/README.md
I think we need to change canonical_name()
to look up targets in the aliases
table rather than using the targettab
dictionary.
For the bundle and versioning, I think we do need to update the version of the collection inventory XML files for MPF and PHX. Currently they indicate that in v1.2 we added the contains
, has_property
, components
, and properties
tables. It makes sense to add v1.3 of the addition of the aliases
table since the inventory will have a new member (aliases.csv
).
Technically we would not need to change the version of the unaffected files, like mpf_targets.csv (.xml)
. But it looks like we did change it for v1.2 even though it was not affected:
and currently the full version history of the bundle is in the new file, mpf_aliases.xml
(was that intentional?):
It may make sense for mpf_aliases.xml
to be at v1.0, while mpf_targets.xml
might more accurately be not v1.2 but v1.1 (because using aliases may change the total number of targets in the mentions table), while mpf_properties.xml
(etc.) should be at their own v1.0 even though they were introduced in v1.2 of the bundle. But this could get quite confusing. The simpler route is to advance all of the file versions to 1.3 so we know they are the versions that were included with v1.3 of the bundle. I think this may have guided the logic of our v1.2 versioning as well.
When processing MPF aliases, I discovered that we need to support UTF-8 names as well (Soufflé).
This would allow us to link known aliases of the same target, like
Jake_M
andJake_Matijevic
.