zme1 / toscana

A repository to house research and web development for the Lega Toscana project, led by professor Lina Insana (Spring 2018) and professor Lorraine Denman (Fall 2018), and with consultation from members of the DH Advanced Praxis group at the University of Pittsburgh at Greensburg.
http://toscana.newtfire.org
3 stars 1 forks source link

XSLT Multi-step transformation? #23

Closed zme1 closed 6 years ago

zme1 commented 6 years ago

With the activity comparisons, I am thinking I should try to manage my expectations with visualizations, given the amount of time I have left and amount of tasks that remain unfinished. I don't know if I have the time to try to generate a heat map (at least in the next 2-3 weeks), but I think I could generate a stacked bar-graph in its stead.

It wouldn't be as precise as a heat map for the data I want to show, but it also would not be as tedious. It could also show some potentially interesting results in its own right, with a visualization that is more compact and that includes all data points in my volume (not just the 40-50 most active members, as my heat map would have done).

I want to try to create a stacked bar chart, with officer activity and general member activity for one full year comprising each bar. The bar heights would all add to 100% activity, but the widths would change according to how many "units" of activity were registered in a certain year. The more active a year, the wider the bar. I think that the nuanced assessment of individual member activity that is lost with this stacked bar chart is replaced by a more all-encompassing look at each year in direct relation to another.

I have some thoughts I wanted to jot down here for clarification or discussion...

If I were to make a stacked bar chart, I think I could most easily do that if I first processed and output the volume to capture every unit of activity that I'm tracking and generate an XML file, containing each unit of activity, the agent of the activity, and whether or not that agent was an officer -- all listed by year (a list within a list, of sorts).

If that seems like a fair approach to this task, I'm wondering if I could generate that in a homemade XML file, outside the bounds of my TEI schema. I only ask this because this seems to me to be a peculiar use of XSLT, and I think that I would have a much easier time generating my own elements, rather than complying with TEI throughout this ordeal. Is it permissible for me to generate a file outside the TEI schema, if I've been working inside the schema until this point in the semester? I'm not sure the precedent on "cherrypicking" work under TEI like this.

ebeshero commented 6 years ago

@zme1 It's fine with me if you output XML that isn't TEI for use in making graphs and charts! You might even invent a special project namespace for it, with a prefix like zme: or tosc: (and your custom namespace is traditionally a real or made-up URL, just to be distinctive). How do make your own namespace? Just set it as your output namespace when you're doing pull processing on your code. (Sure you can just do this in no namespace, too, but rolling your own namespace might be cool.)

<xml xmlns="http://toscana.newtfire.org">
....

</xml>

But I'm sorry you're thinking of abandoning the heat map! I was hoping to see how that turns out and I know David and I would both be happy to help with it. Here's another thought for the heat map--What if you produced it simply as background colors inside table cells? I did that for an old experiment on a Mitford play--it looks kind of hideous and I've long intended to convert it to SVG (one of these days). I was outputting a row of data for every speech division in a play, compared across four versions, and wanted to see for myself how each combination compared with the others and where the biggest differences were... Your data for the heat map may be more complicated but you may want to get a look at the range of numerical differences for each individual member's activity so you have a sense of what numbers should go with "hot" colors and what to go with "cooler" ones.

zme1 commented 6 years ago

@ebeshero I would like to start with this stacked bar-chart; I would like to try making the heat map, though, if I have enough time between the due date for this project and the conference in Ottawa. I want to try to have some semblance of data for each of my major inquiries in the next two weeks, though, and I think that the familiarity of a bar chart could help me along here.

I just pushed some sample xslt code and output into my repo, if you wanted to look at it and see if you think it could work. I haven't included any information of officers just yet; I wanted to get some output first before attempting that. The xslt code is in /xslt/tei/officerGenMember_data.xsl and the output file is in /visualizations/officerGenMember_data.xml, just in case you wanted to quickly see how I plan on compiling this information at this point. The proposal participants are one of 5 or 6 types of appearances that I will consider as explicitly "active."

I may fiddle around with declaring my own namespace in the morning, as well!!!!

djbpitt commented 6 years ago

@zme1 The heat map would be cool because you haven’t done one before, so you’d learn something new. But the starting point might be to think about how heat maps and stacked bar charts differ in what they can express, and in how they express it, and then decide which one is better suited to telling the story you want your visualization to tell. You aren’t limited to just one or the other, of course, except that you are limited by the amount of time you have available. There’s a basic discussion of heat maps at https://www.fusioncharts.com/chart-primers/heat-map-chart/; see also http://www.msktc.org/lib/docs/KT_Toolkit/Charts_and_Graphs/Charts_Tool_Heat_Maps_508c.pdf.

ebeshero commented 6 years ago

@zme1 @djbpitt Thanks for this handy reference! If I could move time and space, I'd squish an SVG exercise to make a heat map into my Coding and Data Viz class right now! As it is, the primer you linked here gives a really good idea of how these can be plotted. As I understand it, one issue with the Lega data is that there's a lot of it, broken down month by month over years--and some of the trouble has to do with summing up continuous activity on a committee even if that isn't indicated in the minutes: For example, people are appointed to Committee X in May, and serve until the committee is dissolved in July. There's data indicating the committee's forming and dissolution, but nothing to indicate continuous activity in between.

Zac and I talked about this a little yesterday evening, that there could be an "XPath solution" wherein, for each person for each month we check to see if that person was appointed to a committee in the preceding months that has not already dissolved, and/or if we can see the dissolution of that committee in the following months. If so, plot this as a point of activity in the current month. The XPath solution is a little complex, but I bet we could figure it out--alternatively, you could write metadata in the TEI headers in intervening months to indicate people active on specific committees--any information that's implicitly understood to be ongoing and not mentioned in the minutes.

I gather the concern here is how to handle weird cases like this--and I sympathize with the time factor, but I also think we can manage this potentially with XPath if the markup for committee membership is clearly signaled!

ebeshero commented 6 years ago

@zme1 For what it's worth, I favor the TEI Header solution: You know your data, and writing in some meta-information about people's implicit activity could be useful for you to output later in your web edition of the Lega documents, as well as serve double duty toward rendering an accurate and data-rich heat map. Also, I wonder if you can get some help from your teammates in encoding that info in the TEI Header--as long as you point out exactly where it's to go?

I gather that working with TEI is also a little exhausting because it takes time to look up the right way to do things--I'm happy to help speed you along with that! I think what we'd need here is to work with the profileDesc with listPerson inside it--something already in your code.

zme1 commented 6 years ago

@ebeshero I think if I put a few hours' work in, I can encode the implicitly active committees into the teiHeader for any given meeting. As for how I will retrieve members from the manually inserted data... Could I potentially add the committee name and end date and use XPath to navigate to the meeting in which those members are listed and return them as units of activity for the present moment in time? One issue I see with that is multiple different committees (with different members) of one type listed in one single meeting getting confused by the XPath searches.

zme1 commented 6 years ago

@ebeshero @djbpitt One solution I'm looking for is simply adding multiple date values to the committee formation dates, but I cannot find any information on precedents or whether or not this is even legal in regard to the date element specifically. For example, if there is a ball committee that forms in March for an event in May, it could maybe look like this:

<list type="committee" subtype="ballo"><date when="1919-03 1919-04 1919-05"/>

That way, every time I process a committee list, all the dates of activity and members are in one location. Although, as far as I can tell multiple ISO values are illegal with the date element or I haven't figured out how to customize that just yet.

zme1 commented 6 years ago

@djbpitt @ebeshero I think I may have found my solution... It looks like I can use @when-custom to generate a customized attribute value (while still in ISO form). I can include an infinite number of ISO values to this attribute, separated by whitespace. Does this seem like it can work?

ebeshero commented 6 years ago

@zme1 This is exactly how I'd encode multiple attribute values, so you can tokenize them on white space! I'm glad you found @when-custom! I didn't know about that one...I was just going to suggest you may need a Schematron constraint in your ODD to make it possible for there to be multiple values, but @when-custom seems like it'll suit your purposes. You still might want a Schematron constraint to tokenize your values on white space and make sure each one is ISO formatted...or you could just run with it and be careful with your encoding!

zme1 commented 6 years ago

@ebeshero I'm running through the code now without Schematron, but I may write a rule on it anyways just to be safe. In the meantime, do you have any advice on actually performing these transformations? I don't know how I plan on trying to use XSLT to create as many different act (for 'activity') elements for each member as there are whitespace-separated ISO values in its preceding sibling date element. I just pushed my most up-to-date XSLT file (xslt/tei/officerGenMember_data.xsl) and XML output (visualizations/officerGenMember_data.xml) to the repository. I apologize in advance for how bloated my xslt is.

zme1 commented 6 years ago

@ebeshero My template information for committees is all at the bottom of the file, and I haven't written anything for the event committees (since I'm still writing in the additional dates right now).

zme1 commented 6 years ago

Custom dates finished...

ebeshero commented 6 years ago

I'm pulling in the XSLT to take a look. Before I do, when I'm working with multiple attribute values separated by whitespace, I use the tokenize() function to split them apart on the white spaces, and use xsl:for-each (or for-loops in XQuery) to walk through them...

zme1 commented 6 years ago

@ebeshero I figured out that much, and I'm sitting with an. <xsl:variable> that tokenizes all the dates. Then, inside the template, I have an <xsl:for-each select="$dateToken"> but I can't move anywhere from the attribute axis, as far as I know.

ebeshero commented 6 years ago

@zme1 There may be another way to work with your variable! As in step through XML nodes and find where they match a member of your variable list.

zme1 commented 6 years ago

@ebeshero I'm stumped here...

ebeshero commented 6 years ago

@zme1 My eyes are pretty tired just now, and I'm not sure I'm following how all of your data now connects together. My sense is that you want to be

It seems like you might want a scoring system: Assign points for activities (just as we might in person): If a person is active once in a month, assign "1". If twice, assign "2" etc. The higher the number, the brighter the color assigned.

ebeshero commented 6 years ago

Maybe it doesn't have to be that complicated...maybe it's just, for a row associated with a person, every time you see a date, output a square in that place. Use opacity. Where there's multiple activities on a date (in a month), the square darkens because multiple squares have been output on top of each other for that month.

zme1 commented 6 years ago

@ebeshero It's ok, don't worry about it! I agree with you here, and I think the easiest way to do that would be to assign one <act> element for a specific person on a specific committee for every month of the committee's work. That way, if Joe was on a committee for four months, he would have four different elements, and each of which has a value of 1 "point" in the scoring system.

Right now, I have a template for all the committee lists in the volume, and the only ones I haven't transformed yet are the ones that span several months. What I want the computer to do, said in plain English, is "for every tokenized date on this specific list, create a new <act> element for every member with that date," so that if a committee of 6 is active for 3 months, I get 18 new elements in my data set.

I'm sorry for bombarding you! Most of this is just me thinking out loud so that I can try to work through it. I have a box of rubber ducks waiting for me at my house....

ebeshero commented 6 years ago

@zme1 That makes sense: you're generating something you can readily count! I see you're generating those <act> elements in the code I pulled in...Looping through tokens that are off the tree is a little tricky, but do-able. If there's something you need on the XML tree to continue into the for-loop, you need to preserve the context node in its own variable! I started to describe this in an earlier post here and deleted it thinking it might not be what you needed, but I think now maybe you do... Let me try to explain with a little made-up example:

<xsl:template match="TreeNode">
<xsl:variable name="tokenized_Dates" select="tokenize(TreeStuff, ' ')"/>
<!--This variable is a bunch of strings, not on the tree at all any more!
 You need to loop through each member of the list and find out 
what's going on with it on the tree for this node you matched on. -->

<xsl:variable name=contextNode select="current()"/>
<!--You need THIS variable to be available inside your for-loop. -->

<xsl:for-each select="$tokenizedDates">
<xsl:value-of select="$contextNode/stuff/on/tree/you/need[@when=current()]"/>
<!--Here current() should refer to the current date-string in the for-loop. 
Check me on this and see if it works! -->
</xsl:for-each>
</xsl:template>

That might just help you get unstuck with what you're processing...I remember the need to define a variable to hold the current context node was kind of a surprise and revelation to me! Things processed inside the for-each loop can lose track of their context unless you can invoke them explicitly. There may be other ways to handle this, but if you're looping through tokenized date strings, a variable on the current() context node (the node currently being touched and processed by your template match), seems to be necessary. Does this help?

zme1 commented 6 years ago

@ebeshero That seems like it could help......

zme1 commented 6 years ago

@ebeshero A step in the right direction!!! My code looks like this:

<xsl:variable name="dateToken" select="tokenize(date/@when-custom, ' ')"/>
                <xsl:variable name="context" select="current()"/>
<xsl:for-each select="$dateToken">
    <act><xsl:attribute name="type">committee</xsl:attribute><xsl:attribute name="ref"><xsl:value-of select="$context/descendant::item/persName/@ref"/></xsl:attribute>
    <date when="{$dateToken}"/></act>
</xsl:for-each>

And my output looks like this:

      <act type="committee"
           ref="#datia #pellegrinic #lunardinim #viganol #maffeigia #maffeid #silvionic #lunardinifr #pasquinelliam #sandronee #maffeigia">
         <date when="1922-02"/>
      </act>

The $dateToken variable works perfectly, the only issue with this as it stands now is that, instead of every @ref getting its own act element, they are all placed in the same one. I think I am close... Maybe a variable that runs over each of the item elements in the list, or something of the like..

ebeshero commented 6 years ago

@zme1 Huzzah! I agree, you're getting closer to a solution here...Here are some questions I have (and these might be ignorant or based on my tired eyes not following, so please excuse me if I'm way off the mark):

1) Are your dates basically all to do with committee meetings, regardless of the people involved? If so, what about defining that as a global variable (not inside an <xsl:template> but available to any template rule)? 2) Why not set your for-each loop on the dates inside a different template, that's isolating the persons? Your $context would be set on the person, and the for-each loop over the dates could generate your <act> elements attuned to each individual... Could that work?

zme1 commented 6 years ago

@ebeshero

  1. The committees are only a portion of the information in this data set, and the dates are not all retrieved in the same relative path, which is why I made those local variables.

  2. I'm trying that right now... Let's see if I can tease it out.

zme1 commented 6 years ago

@ebeshero I was able to get it to the point that every member of an event committee has all the dates of their involvement listed in a child date element. It looks like this:

      <act type="committee" ref="#desimos">
         <date when="1920-07 1920-08 1920-09 1920-10 1920-11"/>
      </act>
      <act type="committee" ref="#silvionic">
         <date when="1920-07 1920-08 1920-09 1920-10 1920-11"/>
      </act>

I am almost there.........

zme1 commented 6 years ago
      <act type="committee" ref="#giannig">
         <date when="1925-05"/>
         <date when="1925-06"/>
         <date when="1925-07"/>
      </act>
      <act type="committee" ref="#scarpellinis">
         <date when="1925-05"/>
         <date when="1925-06"/>
         <date when="1925-07"/>
      </act>

New output look....

zme1 commented 6 years ago

@ebeshero Sorry for spamming you; I'm rubber ducking you again...

I think that the data could work in its current form... This may even seem to be a bit truer to the nature of the data. The action is not multiple different events iterated over a span of however-many-months. It's instead one action that can be considered to have taken place over the span of however-many-months. In this condition, the @date elements can be considered as snapshots of the members' activity, and I can tally those when I make my heat map.

Maybe I'll just be able to ultimately tell the computer to compile all the act elements with a common @when attribute on the date axis, and the act elements with multiple date children will be processed as many times....

ebeshero commented 6 years ago

@zme1 I think you've got something tractable here for counting and plotting! And it sounds like you'll add this date to other kinds of data you're extracting about each person's activity...