wlpotter / csv-to-srophe

A set of XQuery modules for converting CSV data to Srophe-compliant TEI XML records. Developed for Syriaca.org
GNU General Public License v3.0
1 stars 1 forks source link

taxonomy transform should build the taxonomy index #37

Closed wlpotter closed 2 years ago

wlpotter commented 2 years ago

Cf. https://github.com/srophe/srophe-app-data/blob/master/data/subjects/taxonomyIndex.xml for the desired output Cf. https://github.com/srophe/srophe-app-data/blob/master/data/subjects/taxonomyIndex.xsl for the xslt that currently builds this from the records

Priority as part of #6 . Related to #3

wlpotter commented 2 years ago

@dlschwartz as we discussed I will implement the XSLT as an XQuery module for the transform and have the script take over generating the index.

I do wonder, though, if we still want to have a separate index of subjects that is like those for persons, places, etc. (a flat index just containing all the URIs that exist on the server) Does the index which the XSLT makes have all of the keywords in it? In other words, could we use it to check if a subject already exists or is it just to have a quick reference of hierarchical relations?

dlschwartz commented 2 years ago

@wlpotter the current index includes a number of different ways of grouping the keywords followed, starting here: https://github.com/srophe/srophe-app-data/blob/master/data/subjects/taxonomyIndex.xml#L358, by a list of all keyword whether or not they appear above. I don't have a strong opinion about whether we maintain that practice or whether we have two files, one containing groupings and another containing a simple list.

wlpotter commented 2 years ago

Ah, I suppose I should have looked closer at the data 😆

From the perspective of consistency and simplicity, I would prefer having the index of all URIs built by the srophe app in the same way that the persons, et al. are built. (This also allows all entity-types to be treated the same way in the transform for #3 )

But, if you use the listUri[@type="taxonomyAllURIs"] at line 358, I'm happy to have that built along with the groupings -- maybe it lives here in addition to with the other indices?

dlschwartz commented 2 years ago

@wlpotter that's a good point. I trust it's effortless to make two files? The grouped index (including the listURI[@type="taxonomyAllURIs"]) could really just be the index used for SPEAR and the other would be used as the main index of keyword URIs that parallels the other indices.

wlpotter commented 2 years ago

@dlschwartz can we revisit this as I'm getting hung up on the XQuery given the changes we've made to the data model.

For instance, https://github.com/srophe/srophe-app-data/blob/master/data/subjects/taxonomyIndex.xml#L335-L342 has the list of http://syriaca.org/keyword/personal-relationships URIs with the @ana attribute and the SNAP relationship (though in some cases, e.g. https://github.com/srophe/srophe-app-data/blob/master/data/subjects/taxonomyIndex.xml#L345, we use a syriaca: prefix?).

Do we want to keep these formatted this way rather than, e.g.,

      <listURI ref="http://syriaca.org/keyword/personal-relationships">
         <uri ana="mutual">http://syriaca.org/keyword/alliance-with</uri>
         <uri ana="mutual">http://syriaca.org/keyword/casual-intimate-relationship-with</uri>
         ...
      </listURI>

We have the SNAP relationships now encoded as skos:closeMatch or skos:broadMatch within the records (rather than previously as tei:idno elements). I could have the script replicate the current index, but wanted to check in first to make sure there isn't a preferred format.

dlschwartz commented 2 years ago

@wlpotter I'm so sorry! I messed this up. Somehow I sent you to the index in the master branch when I'm currently validating against the index in the dev branch. The dev branch has full URIs for everything: https://github.com/srophe/srophe-app-data/blob/dev/data/subjects/taxonomyIndex.xml#L345.

Also, in the example in your post above, it has the @ref http://syriaca.org/keyword/personal-relationships (in the plural). I've removed these "category" relationships in favor of a more streamlined hierarchy that keeps closer to the SNAP ontology. (In fact, I've completely removed "personal-relationships" for even more complicated reasons.)

A better/current example is "religious-relationship," (formerly "religious-relationships"). This is listed in column AG:AL as a skosBroader for the following: clerical-relationship, monastic-relationship, commune-together, confessor-for, and commemorates. This should output the following: `

http://syriaca.org/keyword/clerical-relationship
     <uri ana="mutual">http://syriaca.org/keyword/monastic-relationship</uri>
     <uri ana="mutual">http://syriaca.org/keyword/commune-together</uri>
     <uri ana="mutual">http://syriaca.org/keyword/confessor-for</uri>
     <uri ana="mutual">http://syriaca.org/keyword/commemorates</uri>

`

I'm guessing the script is correctly outputting this but perhaps the last time you ran the script was before I had made some fairly recent changes. The most up to date version is here: https://docs.google.com/spreadsheets/d/14jU8K-hjFH193zsqXzrdYPYfx2HFqttX0TPScdsbukA/edit#gid=959652535. Can you run that again and we can test the output? Thanks Will!

dlschwartz commented 2 years ago

Sorry, this is easier to look at

<listURI ref="http://syriaca.org/keyword/religious-relationship"> 
     <uri ana="mutual">http://syriaca.org/keyword/clerical-relationship</uri> 
     <uri ana="mutual">http://syriaca.org/keyword/monastic-relationship</uri> 
     <uri ana="mutual">http://syriaca.org/keyword/commune-together</uri> 
     <uri ana="mutual">http://syriaca.org/keyword/confessor-for</uri> 
     <uri ana="mutual">http://syriaca.org/keyword/commemorates</uri> 
</listURI>
dlschwartz commented 2 years ago

@wlpotter it looks like the latest commit has the correct data and I was just looking at an earlier commit.

This looks correct: https://github.com/wlpotter/csv-to-srophe/tree/main/out/subjects/2022-01-06

Just a minute ago I was looking at the link from your email: https://github.com/wlpotter/csv-to-srophe/tree/main/test/out/csv-tests/subjects. This has the old relationships that we've dropped.

dlschwartz commented 2 years ago

Generating the index from this correct commit should produce the properly formatted index. But let's confirm. Thanks.

wlpotter commented 2 years ago

@dlschwartz Ah, thank you for this clarification! This should really help streamline the way the index is generated as the version on dev shows more what I was expecting (a set of listUri elements that list the uris of keywords that have a skos:broader relation to the listURI/@ref).

I can circle back to the index output and make sure it is producing what's on dev.

Also, apologies for sharing the wrong link in the email...I forgot I had treated that as 'real' output rather than test output. So, yes, the data is at https://github.com/wlpotter/csv-to-srophe/tree/main/out/subjects/2022-01-06

dlschwartz commented 2 years ago

@wlpotter thanks! I'm pretty optimistic that everything (except schema) should be in good shape with little or no changes. Thanks.

wlpotter commented 2 years ago

@dlschwartz I think this is working, but I'm still not getting the relationships. I think I need to make them singular and not plural?

I've moved the taxonomy outline to a 'config' xml file, here. This file essentially recreates the outline of the index without filling in the matching data (that's what the script adds). If I understand correctly, the URIs like line 66, "religious-relationships" should be made singular?

If you'd like, you can copy that file and send back how the taxonomy outline should look (or clone/fork the repo and create a pull request).

Otherwise the taxonomy is outputting like we want, I just need to have it save as a file rather than to console

wlpotter commented 2 years ago

Hmm, so it doesn't look like it's just singular and plural issues. For instance, event-relationships appear to have become "related-event", which is nested now in "link" rather than "relationships". I think I'm still missing the most recent changes to the data?

dlschwartz commented 2 years ago

@wlpotter so I've gone through the old index and figured out the categories I need in this index and how to generate this index from the new spreadsheet. Unfortunately, this is kind of clumsy. I'm very open to thinking about other ways to do this, including creating new columns. Hopefully this gets us started though.

What I've got below relies on existing spreadsheet columns:

In the end, the index would look like this:

<taxonomyIndex>
     <listURI type="ethnicity"> 
          <uri>[each URI with skos:Broader = http://syriaca.org/keyword/ethnicity]</uri>
     </listURI>
     <listURI type="fields-of-study">
          <uri>[each URI with skos:Broader = http://syriaca.org/keyword/fields-of-study]</uri>
     </listURI>
     <listURI type="languages">
          <uri>[each URI with skos:Broader = http://syriaca.org/keyword/languages]</uri>
     </listURI>
     <listURI type="mental-states">
          <uri>[each URI with skos:Broader = http://syriaca.org/keyword/mental-states]</uri>
     </listURI>
     <listURI type="occupations">
          <uri>[each URI with skos:Broader = http://syriaca.org/keyword/occupations]</uri>
     </listURI>
     <listURI type="sanctity">
          <uri>[each URI with skos:Broader = http://syriaca.org/keyword/sanctity]</uri>
     </listURI>
     <listURI type="socioeconomic-status">
          <uri>[each URI with skos:Broader = http://syriaca.org/keyword/socioeconomic-status]</uri>
     </listURI>
     <listURI type="related-event">
          <uri>[each URI with skos:Broader = http://syriaca.org/keyword/related-event]</uri>
     </listURI>
     <listURI type="relationships">
          <uri ana="[content of column G, either "mutual" or "directed"]">
                [each URI with skos:Broader = 
                     http://syriaca.org/keyword/extended-household-of
                     http://syriaca.org/keyword/slave-of
                     http://syriaca.org/keyword/household-of
                     http://syriaca.org/keyword/emnity-for
                     http://syriaca.org/keyword/sender-of-letter-to
                     http://syriaca.org/keyword/alliance-with
                     http://syriaca.org/keyword/kin-of
                     http://syriaca.org/keyword/family-of
                     http://syriaca.org/keyword/hereditary-family-of
                     http://syriaca.org/keyword/extended-family-of
                     http://syriaca.org/keyword/descendent-of
                     http://syriaca.org/keyword/ancestor-of
                     http://syriaca.org/keyword/serious-intimate-relationship-with
                     http://syriaca.org/keyword/legally-recognized-relationship-with
                     http://syriaca.org/keyword/professional-relationship
                     http://syriaca.org/keyword/military-relationship
                     http://syriaca.org/keyword/legal-relationship
                     http://syriaca.org/keyword/colleague-of
                     http://syriaca.org/keyword/religious-relationship
                     http://syriaca.org/keyword/monastic-relationship
                     http://syriaca.org/keyword/clerical-relationship
                     http://syriaca.org/keyword/bishop-over
                     http://syriaca.org/keyword/intellectual-relationship
                     http://syriaca.org/keyword/cited
               ]
          </uri>
     </listURI>
     <listURI type="qualifier-relationship">
          <uri>[each URI with skos:Broader = http://syriaca.org/keyword/qualifier-relationship]</uri>
     </listURI>
     <listURI type="taxonomyAllURIs">
          <uri>[each URI]</uri>
     </listURI>
</taxonomyIndex>
wlpotter commented 2 years ago

@dlschwartz I think I've got the script generating the index the way you specify in your most recent comment

Here is a sample output: https://raw.githubusercontent.com/wlpotter/csv-to-srophe/main/test/2022-03-03_test-taxonomy-index-output.xml

Let me know if that looks like it's working. If so, I can walk you through how the taxonomy config file works in case you need to update the selected categories, etc.

Note to self: I still need to implement saving the index to a file, currently just outputting to console for debugging purposes. (should be as simple as adding an 'output path' variable to the config or config-taxonomy)

wlpotter commented 2 years ago

Also, I know you mentioned that we need to re-run the transform to catch some new data changes. Once we get the index working satisfactorily I'll pull down the most recent spreadsheet data and generate a new batch of XML files and an updated index.

dlschwartz commented 2 years ago

@wlpotter Fantastic! This looks perfect. Thanks so much. Before writing back I though briefly of trying to create keywords for place types and for religious confessions, but that's not going to happen. I think we can run this now. Will this be set up in a way that I can run this transform? Should I learn how to do that? Thanks Will!

wlpotter commented 2 years ago

Yes, it should be set up to where you can run it after adjusting the configuration settings (which are in an xml document). I will go back through the documentation to make sure it's up to date, then we can walk through how to run the transform.

wlpotter commented 2 years ago

The files here should be up to date. The index is there as well under https://github.com/wlpotter/csv-to-srophe/blob/main/out/subjects/2022-03-10/index/taxonomyIndex.xml.

I am opening an issue on the srophe app repository for Syriaca (https://github.com/srophe/syriaca/issues/20) so we can test the new data there once we've moved it over.

I believe, unless you notice any glaring issues with the most recent output, this issue can be closed?

dlschwartz commented 2 years ago

@wlpotter I'm getting around to testing this and I've found some problems. I didn't include http://syriaca.org/keyword/bond as one of the skos:Broader to grab for <listURI type="relationships">. I also managed to have "emnity" in various places instead of the correctly spelled "enmity". I've fixed the taxonomy data but it would be best if you could generate the index again including keyword/bond. Thanks Will.

wlpotter commented 2 years ago

@dlschwartz I can re-run the index to include bond in the list of skos:Broader concepts under the <listURI type="relationships">.

This made me realize that it would be worth having an additional, stand-alone script that just re-generates the taxonomy index as needed. The main transform script will keep the functionality as well, so you won't generally need to run two scripts. But you can have the option if you are only interested in the taxonomy index.

I will write this script and add instructions to the documentation.

wlpotter commented 2 years ago

@dlschwartz I updated the index and moved it to the main srophe-app-data repository (see this commit). Let me know if you notice anything else.

dlschwartz commented 2 years ago

@wlpotter That all sounds great. I think we might want to keep it in the app documentation. I'll move it over. Thank you!

wlpotter commented 2 years ago

To fix the issue with qualifier-relationships showing up under <listURI type="relationships">, I will make the following changes:

wlpotter commented 2 years ago

@dlschwartz I have added the 'include self' functionality to the index generation. Would you like me to upload the new version to the server or do you want to do one last round of spot-checking?

dlschwartz commented 2 years ago

@wlpotter I'll go ahead and move it over. I'd like to open both files in oxygen and compare them first just to be certain. Thanks!

wlpotter commented 2 years ago

@dlschwartz just saw some errors (it's accidentally grabbing all the tei:idnos not just the keyword URI...) I will fix that, update the taxonomy index file, and send you a link.

wlpotter commented 2 years ago

It may have been a false alarm (I've been having to switch back and forth between app-data master and dev for manuscript work, and I think I was running the index generation on old data...)

In any case, the file here should now be updated with the newest version (removed "bond" from the "relationships" listURI, and the "qualifier-relationships" shouldn't be there either)

dlschwartz commented 2 years ago

@wlpotter you've got it regarding "bond" and "qualifier-relationships"! The only odd thing I find is under the "relationships" listURI where there is a "descendent-of" [sic] with no @ana attribute: https://github.com/wlpotter/csv-to-srophe/blob/main/out/subjects/2022-03-10/index/taxonomyIndex.xml#L294. I've looked in the spreadsheet and can't find this misspelling. Can you take a look at this as well? Thanks Will.

wlpotter commented 2 years ago

@dlschwartz Ha! That was a typo in the taxonomy config file...I've updated it and re-run. The same link above should work still. The correct "descendant-of" is there with @ana="directed"

dlschwartz commented 2 years ago

@wlpotter Looks great! Thanks. I'll move this over.