wlpotter / csv-to-srophe

A set of XQuery modules for converting CSV data to Srophe-compliant TEI XML records. Developed for Syriaca.org
GNU General Public License v3.0
1 stars 1 forks source link

Identical bibl information not being merged correctly #43

Closed wlpotter closed 2 years ago

wlpotter commented 2 years ago

For example, in person 3788 we have the following two bibls generated:

          <bibl xml:id="bib3788-1">
            <ptr target="http://syriaca.org/bibl/671"/>
            <citedRange unit="p">277 [447]</citedRange>
          </bibl>
          <bibl xml:id="bib3788-2">
            <ptr target="http://syriaca.org/bibl/671"/>
            <citedRange unit="p">277 [447]</citedRange>
          </bibl>

The script should only be generating a list of distinct tei:bibl elements.

wlpotter commented 2 years ago

I believe this is caused by the fact that in one case the citationUnit is left empty while in the other it is filled in with "p", so when functx:distinct-deep is run on the sources index, the two can't be determined to be identical.

One solution would be to add "p" to the citationUnit further up the chain? e.g., if empty, make it "p"? (or vice versa where we strip out any <citationUnit>p</citationUnit>s to just be empty and rely on the defaulting mechanism in the bibl element creation?

wlpotter commented 2 years ago

I think adding a "p" to empty citationUnits when the source index is created for a given row is the right move here. This could also mean we can delete the default addition of the "p" value for @unit on tei:citedRange elements in the bibl element creation

wlpotter commented 2 years ago

Hmm this is proving more difficult than I anticipated...

There are three places where this would need to be addressed:

  1. the generation of the per-row source index
  2. the generation of the per-row data index for a given element sequence (because they also have citation unit data)
  3. the matching of per-row source index with per-row data (need to handle mismatches of "p" and "")

The really tricky case is what to do if there is no citation unit column in the first place, as happens for several of our input cases.

wlpotter commented 2 years ago

I believe I've isolated the problem. Making changes to the following three functions:

  1. csv2srophe:create-sources-index-for-row
    • normalize any citationUnit elements that have "pp" to "p"
    • add a citationUnit element that has "p" if one does not exist
  2. csv2srophe:create-bibl-sequence
    • remove the condition for creating the @unit value, instead just pass the citationUnit value (given the change above, this conditional test becomes redundant)
  3. csv2srophe:create-source-attribute-for-element
    • normalize the citationUnit element in $itemData if it has "pp" -> "p"
    • add a citationUnit element to $itemData if there isn't one
wlpotter commented 2 years ago

This appears to have worked.