zme1 / toscana

A repository to house research and web development for the Lega Toscana project, led by professor Lina Insana (Spring 2018) and professor Lorraine Denman (Fall 2018), and with consultation from members of the DH Advanced Praxis group at the University of Pittsburgh at Greensburg.
http://toscana.newtfire.org
3 stars 1 forks source link

XPath Expression Problem #16

Closed zme1 closed 6 years ago

zme1 commented 6 years ago

In creating my lists of incoming Lega Members, I realize that a few are duplicated within the same year. I have been trying to find them by launching an XPath search that identifies any persName element whose @ref appears more than once in a list of @type='applicants', but I can't figure out where I am going wrong. I have tried about 7 or 8 different combinations of searches.

My most recent and succinct search was the following:

//persName/@ref[count(ancestor::list[@type='applicants']) gt 1]

I walked down to every @ref in the whole volume, and tried filtering the return by giving me only those who have more than one list of @type='applicants'] but the return is zero. I am unsure as to how exactly to predicate this..

zme1 commented 6 years ago

Sidenote, my cancellatiAccettati_verticalBar.xsl file adds an xhtml namespace to all my rect and text elements in my svg, and I have no idea as to why it's doing that....

djbpitt commented 6 years ago

@zme1 Anent your sidenote, I’ve fixed the namespace problem in the js branch. Your template that creates the SVG <rect> and <text> element is creating them in the HTML namespace because you’ve (correctly) made that the default namespace. The fix is to add a namespace declaration that binds the svg: prefix to the SVG namespace and use the prefix when creating those elements. That way all of your literal result elements will be in the HTML namespace except when you use the svg: prefix to specify the SVG namespace.

djbpitt commented 6 years ago

@zme1 Concerning duplicate applicants, can you identify a name that is repeated in the list of applicants for a year? I’ve pushed (in the new XQuery subdirectory on the js branch) an XQuery script that, when run against volume1.xml, produces an HTML report of @ref values by year, along with a count of how often each occurs in the list of applicants for that year. (I’ve saved the HTML output in the same directory, so you can view it without having to run it first.) I don’t see any that are reporting more than one appearance. It’s very possible that I’ve screwed up the XQuery, so it would be helpful if you could identify a specific duplicate that I could then use as a target during debugging.

djbpitt commented 6 years ago

@zme1 Er ... my report is actually applicants by meeting, not by year. If I’ve understood correctly that that’s what you wanted, my error is in the way I’ve titled the report, rather than the data. But I may also have screwed up the counting, about which see my comment above ...

zme1 commented 6 years ago

@djb Olindo Pacini (@ref='#pacinio') appears in two different applicant lists in 1919, first in August, then in November. His name is the only one that I know with certainty is duplicated in the applicant lists, but I strongly suspect there are others. My opinion on the matter is that, while a situation like this, in which one person is found on two separate lists very close together temporally, seems to be some type of erroneous duplication in the minutes, I'm not so sure about those who appear in two lists over a greater span of time (as a matter of fact, Olindo Pacini actually reapplies in 1925) seem to be correct. If you remember our conversation on the flow of members, it's not very tidily kept at all times. I think this may be in large part because members' status can be suspended if they fail to keep up with membership dues, but I think they can reapply after a certain amount of time. Another possibility is that some may have needed to reapply after an extended time away from the Lega or from the Pittsburgh area.

djbpitt commented 6 years ago

@zme1 Ah! I was looking only for duplication within the same <list>, so I wasn’t catching people who were repeated in different lists that were close in time. When you write “very close together temporally”, how should we formalize closeness? We could do it by meetings (e.g., appears more than once within a sliding window of three meetings) or dates (e.g., appears more than once in meetings within a sliding window of 90 days). Or in other ways, if you have something else specific in mind.

zme1 commented 6 years ago

My gut ruling on the matter was multiple appearances on lists within the same calendar year. If, after deterring how widespread this issue is, we find that it’s more common than I believe it to be, we can be constrain that range to a smaller time frame.

Le 17 mars 2018 à 22:25, djbpitt notifications@github.com<mailto:notifications@github.com> a écrit :

@zme1https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fzme1&data=01%7C01%7Czacharyenick%40pitt.edu%7C78c462a86b7a485554ee08d58c778506%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1&sdata=qmhUF%2F3MQyI3gaEaNEvdupR4uIzwkhtnozsbC7yBgIc%3D&reserved=0 Ah! I was looking only for duplication within the same , so I wasn’t catching people who were repeated in different lists that were close in time. When you write “very close together temporally”, how should we formalize closeness? We could do it by meetings (e.g., appears more than once within a sliding window of three meetings) or dates (e.g., appears more than once in meetings within a sliding window of 90 days). Or in other ways, if you have something else specific in mind.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fzme1%2Ftoscana%2Fissues%2F16%23issuecomment-373967837&data=01%7C01%7Czacharyenick%40pitt.edu%7C78c462a86b7a485554ee08d58c778506%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1&sdata=F%2BzMRi%2F%2B2Juoe95R61qF%2FfuQYkMLuJ3FaeiR7%2BoLn94%3D&reserved=0, or mute the threadhttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAYQ97prJqQo73ffjN1175ybPHey9jdOXks5tfcWZgaJpZM4Su88J&data=01%7C01%7Czacharyenick%40pitt.edu%7C78c462a86b7a485554ee08d58c778506%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1&sdata=vD4egLeOyqZQK0mQoXOaQSpGE3AC9rPDrL8dgbFvHwY%3D&reserved=0.

ebeshero commented 6 years ago

@zme1 @djbpitt I'm tuning in a little late to this party, I see, but when I first saw your message, I thought this was a perfect application of a for-loop, as in for $i in distinct-values(//@ref), walk the tree and find out how often it's repeated and where. Usually that's a bit much for the XPath window, and I think you can walk this back from David's more complicated XQuery, since you don't need to find repetition within a year. You need to find any repetition at all... Are you both writing XQuery in oXygen at this point or in eXist-db? (I notice David's not using a doc() function so I suspect this is just in oXygen...)

ebeshero commented 6 years ago

@zme1 A short XQuery shows me that you have 167 persNames listed with @ref inside applicant lists, and when I take distinct values of those I get 150. So it seems you have 17 repeat applicants, or at least 17 instances of some kind of duplicate (could be one person who just keeps trying to join and join 17 times...;-) ) Now, to determine where they appear, we'll just need to build on this and loop through the list of distinct values. Anyway, here's my starter in oXygen:

declare default element namespace "http://www.tei-c.org/ns/1.0";
let $applicants := //list[@type="applicants"]//persName/@ref/string()
let $dvApps := distinct-values($applicants)
return (count($applicants), count($dvApps))
ebeshero commented 6 years ago

@zme1 Here's a pull request from my ebb branch with the rest of that XQuery I started and a little text-file read-out. I figure you want to look at each of these instances in turn to see how far apart they are in time and evaluate what repetition might mean in each case.

https://github.com/zme1/toscana/pull/17 and https://github.com/zme1/toscana/pull/17/files

ebeshero commented 6 years ago

@zme1 Probably @djbpitt or I should have walked through what wasn't working with your initial XPath expression on this issue: //persName/@ref[count(ancestor::list[@type='applicants']) gt 1]

With this you are walking your tree down to the @ref attributes on every persName element, and looking for any such attribute where there's a count of more than one applicant list ancestor. It always seems strange to me that oXygen returns ancestors of attributes beyond their parent elements, but it does. Nevertheless, I think in your project code, this search is always going to turn up empty unless you have nested list[@type="applicants"] within one another. (And no, you don't ever have this construction--an XPath search on //list[@type="applicant"]//list yields no results. )

zme1 commented 6 years ago

@ebeshero I am looking at your most recent pull request, and I think it helped me to try and fish for a different way to approach this situation. On a side note, your XQuery returned some results that are not in full ISO format, and those dates are not actually dates of meetings, but they are instead dates assigned to committees within the same meeting as the list in which they appear as an applicant. All of the dates that appear in full ISO format are those I'm considering, and there are three instances I have found...

  1. A member's name appears twice on two different application lists within the bounds of one year, and who has no other appearance in the minutes in between. (#pacinio: 1919, #barsantif, #marsilie, #diodatir, #tamburia)
  2. A member's name appears twice on two different application lists within the bounds of one year, but whose name also appears in the minutes between those two points (#marchettie)
  3. A member's name appears twice on two different application lists that are at least 1 full year apart (in all but one case, it's actually at least 4 years) (#pacinio, #tambellinial, #piccininig)

I think the latter two cases are more cut-and-dry. With the third group, I think they are members who actually did apply twice, after silently leaving the Lega at some point in between. Our research group was told by a family member of the Lega that this was actually not too uncommon; members oftentimes took a leave of absence to return to Italy for a certain amount of time, or they were put on probation for having missed so many monthly due payments. I think the first and second group may be the same kind of situation, and after having read each of the contexts more closely, I wanted to know your thoughts. I believe that, with the first six members, they are accepted at the first date, but initiated at the second date. With all new accepted members, presentation of medical histories and certificates at a meeting are required, and almost all the new members do that at the time of acceptance. That probably means that they were accepted and initiated simultaneously.

The reason I think that is specifically because of Ruggero Diotati, who appears in two lists within a few months of each other, but between those two points he is mentioned again, but the minutes say that "...not having any new applications for membership, and new member Ruggero Diodati not having presented himself for admission for the second time, ..." which leads me to believe that new members must personally present their certifications to the Lega before they can be fully accepted. I think, then, that maybe scenarios 1 and 2 must be addressed in the visualization, but the third one is valid data.

ebeshero commented 6 years ago

@zme1 This sounds to me like something you want to document in your files, for processing purposes and also perhaps for reading purposes as you produce an edition of the minutes on your site. What if you inserted <note resp="#zme">....</note> elements after the introduction(s) of these people in their <list type="application"> ? They are not many, and they constitute special cases that you need to account for in your study. You could explain in each case what you and the team speculate about each person's initiation.

In your processing, if there is a <note> element present inside a list item about initiating a new member, that can be a sort of flag for you to process these people differently. You could leave yourself a note on your XSLT or XQuery files you're writing to check your short list of special cases and modify your output data to include or exclude them. Would this work, do you think? This is one of those cases where data about human activities is messier than a computer can understand. :-)

zme1 commented 6 years ago

@ebeshero Do you think that I could rewrite my visualization so that, if an item in the list contains a <note resp="#enickz> I excluded it from the visualization, that it could be a viable workaround? It could maybe serve a dual purpose of informing the text and steering the data set in the right direction, I think.

zme1 commented 6 years ago

@ebeshero I think I may have just repeated what you said back to you, oh well...

ebeshero commented 6 years ago

@zme1 Ha! :-) It's not really a repetition; you're considering the processing consequences. Yes, that's along the lines of my suggestion, and if it makes your processing run smoothly and helps to document your special cases in a helpful way, that's the general idea here.