zme1 / toscana

A repository to house research and web development for the Lega Toscana project, led by professor Lina Insana (Spring 2018) and professor Lorraine Denman (Fall 2018), and with consultation from members of the DH Advanced Praxis group at the University of Pittsburgh at Greensburg.
http://toscana.newtfire.org
3 stars 1 forks source link

Lemmatized forms table #54

Closed zme1 closed 5 years ago

zme1 commented 5 years ago

@ebeshero I tried to re-hash my computations of the lemmas in the volume in XSLT, since I am much more fluent with XSLT than I am with XQuery. I made much more progress this afternoon/evening on pulling up information on the terms used and the number of times they're used, but I have one more major obstacle that I have had no luck figuring out today. I wrote some for-each loops over the forms in each year, and the counts of these forms are all accurate. My goal, though, is to only write the terms for words used multiple times and to write an additional line on "words only used once" with a count of single-use lemmas in a given year. Since I've written these lines mostly within that for-each loop, the lemmas are still being processed and rendered individually, and I don't see how I can tell my computer to "count the number of lemmas that are used only once in a year and output that number only once." Here is the XSLT and matching output I have so far:

    <xsl:template match="teiCorpus/teiCorpus">
        <xsl:variable name="year" as="xs:string" select="teiHeader/fileDesc/descendant::date/@when"/>
        <xsl:variable name="lemmas" as="xs:string+"
            select="distinct-values(descendant::w[not(ancestor::foreign)]/@lemma)"/>
        <tr>
            <td>
                <xsl:value-of select="$year"/>
            </td>
            <td>
                <ul>
                    <xsl:for-each select="$lemmas">
                        <xsl:variable name="currentLemma" select="."/>
                        <xsl:variable name="currentYear"
                            select="$root/teiCorpus/teiCorpus[teiHeader/fileDesc/descendant::date/@when eq $year]"/>
                        <xsl:variable name="lemmaMatch"
                            select="$currentYear/descendant::w[@lemma eq $currentLemma]"/>
                        <xsl:variable name="matchCount" select="count($lemmaMatch)"/>
                        <xsl:if test="$matchCount gt 1">
                            <li>
                                <xsl:value-of select="concat($currentLemma, ': ', $matchCount)"/>
                            </li>
                        </xsl:if>
                    </xsl:for-each>
                </ul>
            </td>
        </tr>
    </xsl:template>
     <tr>
         <td>1920</td>
         <td>
            <ul>
               <li>dollar: 3</li>
               <li>box: 4</li>
               <li>candy: 3</li>
               <li>floor: 2</li>
               <li>manager: 2</li>
               <li>chairman: 3</li>
               <li>committee: 2</li>
            </ul>
         </td>
      </tr>

While it looks as I expected, I haven't incorporated the single-use terms yet. The closest I've gotten as of yet is with this additional line below, which I wrote immediately below the xsl:if statement in my XSLT file:

<li><xsl:value-of select="count($currentYear/$lemmas[$matchCount eq 1])"/></li>

Which produces the following:

      <tr>
         <td>1920</td>
         <td>
            <ul>
               <li>16</li>
               <li>16</li>
               <li>dollar: 3</li>
               <li>0</li>
               <li>16</li>
               <li>box: 4</li>
               <li>0</li>
               <li>16</li>
               <li>candy: 3</li>
               <li>0</li>
               <li>16</li>
               <li>16</li>
               <li>floor: 2</li>
               <li>0</li>
               <li>manager: 2</li>
               <li>0</li>
               <li>16</li>
               <li>chairman: 3</li>
               <li>0</li>
               <li>16</li>
               <li>committee: 2</li>
               <li>0</li>
               <li>16</li>
            </ul>
         </td>
      </tr>

That returns the count of single-use words, but both too many times and not enough times to be correct. It returns too many because I'd only like one line for all of them (I realize that I'm processing this in a for-each loop, so it'll return one li for each lemma), and it doesn't return enough because, for some reasons, it will either return a 0 or the correct number, in no discernible pattern... All of my other attempts have been overly convoluted and not at all correct, so I figured I'd take to GitHub!

ebeshero commented 5 years ago

@zme1 If I understand this right, you want different processing behavior for lemmas when they appear more than once per year vs when they appear just once. I see that you are processing this in a template that matches on teiCorpus elements by year, and every time you stop in a teiCorpus element for a given year, you are for-eaching over distinct-values of the lemmas. The way you are trying to retrieve a count of terms used only once in a year is problematic because when you are in the for-each loop, you are processing one word at a time, so your attempt to count all the words used only once in a year is happening term-by-term instead of year-by-year.

I think you want to process this information on lemmas in two different ways in this template rule. Don’t process those singleton lemmas in your for-each loop, since you only want that detailed info (word and its count) to be listed if it is repeated, right? So, for-each is only appropriate for lemmas when their count in the current teiCorpus is greater than one.

You need to be able to distinguish lemmas that only appear once from those that appear multiple times. It sounds like you want each table cell to contain a given year’s list of multiply used words, followed by a summary count of the words used just once. Process that outside your xsl:for-each statement that lists each word one by one.

This raises the question of how to separate the two kinds of lemmas. We understand how to loop over all of them, and how to test, one at a time, whether the individual members of a distinct-values list have a count() greater than one, vs equal to one when you map them back on the tree. We should be able to separate these into two different variables: 1) a variable for multiply-used lemmas only, 2) a variable for singletons only.

Process the first variable with your xsl:for-each because you want to list out these terms one by one.

Process the second simply as a count() of all its members.

Before we go any further with this, let me know if I understand what you’re trying to do...if I do, I think we may want to work with the XPath form of a for-loop to construct your variables. That is, for $i in $lemmas return $i[count(TREE//lemma = $i) gt 1]

Mapping back to the tree may be a little tricky. I think you want to define a variable just as you open your template rule that holds the current context node, literally the teiCorpus you are currently matching. Try this:

<xsl:template match="teiCorpus/teiCorpus">
       <xsl:variable name="current" select="current()"/>

<!— Define lemma variables, process, etc. Use $current as a handy way
 to map from distinct-values back to the tree when you need it.  —>

</xsl:template>
zme1 commented 5 years ago

@ebeshero So, right now the lemmas are all included in one global variable at the top of my document. Would it be easier, then, if I converted broke that into two local variables -- with one that contained duplicated lemmas and another that had the singletons?

zme1 commented 5 years ago

@ebeshero Scratch that, it is a local variable... The issue I've had with declaring these variables yesterday is that I am having trouble determining a way to declare a variable based on occurrences of an attribute value without going through a for-loop. So what I did in the variable above is just group all the lemmas together and then count them from that point. This is most certainly where the issue resided, because then even all the singletons are processed and rendered individually, regardless of what else I write below (since they still fall in that for-loop).

ebeshero commented 5 years ago

@zme1 I've just pulled in your code and am taking a look...let me try a couple of things on my branch to see if I've understood the issues with your for-looping...

ebeshero commented 5 years ago

@zme1 Aha! I've learned something interesting here. I'm messing around with the xpath version of for $i in ... return $i and the XSLT version of it with xsl:for-each and they behave differently! I had a lot better luck with defining variables with xsl:for-each loops inside.

So you might try defining a variable for multiply-appearing lemmas like this:

<xsl:variable name="multiLemmas" as="xs:string*">
    <xsl:for-each select="$distLemmas">
     <xsl:if test="count($current//w[not(ancestor::foreign) and @lemma eq current()]) gt 1">
        <xsl:value-of select="current()"/>
      </xsl:if>
    </xsl:for-each>
 </xsl:variable>

You can do something very similar for singleton Lemmas. NOTICE: when you do the type-casting, be sure to use xs:string* instead of xs:string+. Why? I noticed that in at least one year, you don't have any singleton Lemmas at all. The * repetition indicator accounts for zero or more, and if you don't use it here, XSLT will get stumped on the variable definition.

ebeshero commented 5 years ago

@zme1 Note: My $current variable above is defined like I mentioned earlier: It's just the current xsl:template match value, and I defined it as the first variable in the template rule, like this:

 <xsl:template match="teiCorpus">
        <!--ebb: Current context node, stored in a variable, for use in for-each loops: -->
        <xsl:variable name="current" select="current()"/>

<!-- variable stuff and xsl:for-each processing -->

</xsl:template>
zme1 commented 5 years ago

@ebeshero I always forget that the xsl:variable elements can contain child elements...

I am much closer to a solution now that you've helped me process the two categories of lemmas independently, but my counts still look off with the "single lemmas." For instance, there are five uses of three lemmas in 1919 (one word three times, two words twice), but my single lemmas variable in 1919 says that there were three lemmas used only once. We're almost there! I've just pushed a fresh copy in case you wanted to check it out. It looks like, for some reason, it's counting all the distinct lemmas, regardless of the number of times they're repeated.

ebeshero commented 5 years ago

@zme1 I wound up writing my own XSLT to see how to process this--it's in my branch and I just made a pull request: https://github.com/zme1/toscana/pull/55

I think this code is working, but see what you think...

zme1 commented 5 years ago

@ebeshero Your output looks perfect, and I'll pull it into my repository this afternoon. What I wanted to do first, though, is figure out where my code goes wrong. So far, I'm having trouble determining where mine breaks... Once I figure that out, I'll work your copy over into the master branch.

It seems as though our variables are declared in similar ways, and most, if not all, of our differences in declarations are interchangeable. Nevertheless, I've modified my own code to reflect some of yours to see if that has any impact on my own output, but to no avail. If you have time, would you mind looking over my newly pushed copy of the top-lemma.xsl file and comparing it to your branched copy? The bug is in there somewhere and I'm sure I'll nail it down before the evening comes, but feel free to check it out yourself to poke around. I'll let you know where exactly I went wrong when I figure it out.

Thank you so much, @ebeshero, for generating this output for me. I'll be combing through your XSLT carefully this afternoon!

ebeshero commented 5 years ago

Sure—I thought you might want more explanation of what went wrong, but for me to figure it out, I needed to write my own code first. I’ll post some comments on your code soon. Elisa

Sent from my iPhone

On Dec 18, 2018, at 7:43 PM, zme1 notifications@github.com wrote:

@ebeshero Your output looks perfect, and I'll pull it into my repository this afternoon. What I wanted to do first, though, is figure out where my code goes wrong. So far, I'm having trouble determining where mine breaks... Once I figure that out, I'll work your copy over into the master branch.

It seems as though our variables are declared in similar ways, and most, if not all, of our differences in declarations are interchangeable differences. Nevertheless, I've modified my own code to reflect some of yours to see if that has any impact on my own output, but to no avail. If you have time, would you mind looking over my newly pushed copy of the top-lemma.xsl file and comparing it to your branched copy? The bug is in there somewhere and I'm sure I'll nail it down before the evening comes, but feel free to check it out yourself to poke around. I'll let you know where exactly I went wrong when I figure it out.

Thank you so much, @ebeshero, for generating this output for me. I'll be combing through your XSLT carefully this afternoon!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

zme1 commented 5 years ago

@ebeshero I finally found my error, and it was incredibly obscured in my code.... Drumroll, please!

Here was my original variable declaration for Singleton Lemmas:

        <xsl:variable name="singleLemmas" as="xs:string*">
            <xsl:for-each select="$distinctLemmas">
                <xsl:if
                    test="count($currentYear//w[not(ancestor::foreign) and @lemma eq current()]) eq 1"/>
                <xsl:value-of select="current()"/>
            </xsl:for-each>
        </xsl:variable>

The xxl:if statement in my variable was empty... So the count was made every time, but there was no effect. Thus, all the distinct lemmas were output regardless of their count... That was a particularly difficult error to find, even with your unbroken code in the same viewing window!

ebeshero commented 5 years ago

@zme1 Wow! I don’t think I spotted that—I just thought there was something wrong with the for-each levels—with what you’d first selected to loop over. I’m glad you found your bug!