zme1 / toscana

A repository to house research and web development for the Lega Toscana project, led by professor Lina Insana (Spring 2018) and professor Lorraine Denman (Fall 2018), and with consultation from members of the DH Advanced Praxis group at the University of Pittsburgh at Greensburg.
http://toscana.newtfire.org
3 stars 1 forks source link

Commencement of the Second Leg of Toscana #48

Closed zme1 closed 6 years ago

zme1 commented 6 years ago

The repo is awake and alive once again! And with that, I have my first developmental concern...

The first and second portions of this project are only related by my primary source document, the verbale. As such, should I treat this primary source as material for an entirely new project? Should I or shouldn't I modify the source document to accommodate my current research goal (in other words, filter out the seg and list elements through XSLT to provide a cleaner starting file)? Also, should I just simply draft a new ODD file, or should I continue to build the one that I made last term? My instincts tell me to go clean slate on both fronts, but I want to know what you think, @ebeshero.

ebeshero commented 6 years ago

@zme1 Good to see the repo coming back to life! I'm reviewing your original ODD and recalling that markup. Let me answer your question with a couple of questions to be sure I'm understanding:

1) It sounds as if you imagine your new linguistic markup to be incompatible with the old markup of participants in the Lega. Is that indeed the case?

2) Your original markup of <seg> was geared to tracking participant activity. Is there any benefit to connecting the use of anglicisms to particular individuals mentioned in the minutes?

If the answer to these is "no" then there's certainly no harm in keeping multiple TEI editions of your documents. However, since you are thinking about creating a separate "clean slate" edition of your Lega documents, you might consider whether you want to try a stand-off annotation method. Read more about stand-off annotation in the TEI Guidelines in 16.9, 16.10, and 16.11 (so start here and continue to the end of the chapter). You'll see links out to ch. 17 on Simple Analytic Mechanisms and chapter 18 on Feature Structures, and you'll want to read those and give them some thought.

I wonder if it makes sense to transfer your markup about people to one of these stand-off formats, and perhaps in removing it with XSLT you might simply place that data in a new file and hold pointers of some kind in the Lega minutes. For example, one entry of transcribed minutes might contain a <ptr target="URI"/> that points to a file containing a correlated set of metadata about people in each entry. You could use a system of @xml:ids and @target attributes to connect the files together so that the data on people is still available by connection to the minutes files that will now be featuring linguistics markup. What do you think?

zme1 commented 6 years ago

@ebeshero Browsing the TEI site you referenced, this looks like it may be a useful tool. I'll bookmark it for tomorrow and Thursday and see what I can find out.

And to answer your questions, the entire volume is logged through indirect discourse, so it would probably be futile to try to incorporate any members beyond those who actually wrote the minutes into this study. Although my mind has been changed countless times by countless developments on this project in the past. Only time will tell!

zme1 commented 6 years ago

@ebeshero I've been browsing online resources for stand-off markup, but I'm honestly hazy on how exactly it is implemented... My understanding is that I want to process my original file and output (or externalize, as TEI phrased it) two separate files from it -- one containing my updated corpus, with all structural tags still in place (which I still want to keep) and "pointers" in place of content-based tags from last semester's research (like all seg and list tags, for example), and the other containing information on those pointer tags in the first document. I'm having a bit of trouble relating that concept to the examples shown in the site and in other online resources I've found on stand-off markup. It seems to me like my newly output corpus file would thus include different referential markup in place of older elements that whose information is now held on an external file.. Is that what this would look like? @ebeshero

ebeshero commented 6 years ago

@zme1 TEI doesn’t have the greatest examples of stand-off implementation (yet)—I think this is a relatively new domain for the Guidelines, and it will improve. Some implementations are super complicated, like the ones that involve referencing strings of text by counting character position. I don’t recommend those in your case! Really, stand-off can be as simple as an attribute on a wrapper element that points to more detailed information elsewhere in another file, or stashed up in the TEI header.

So, what if you wrote an XSLT to extract the data you were going to remove from your original markup anyway, and organized that data in a second file that you design to correlate with the Lega minutes? How could you structure the new file, and construct a pointing system for referencing it to the minutes? What elements are best to carry anchoring ids and pointer targets?

zme1 commented 6 years ago

@ebeshero The guidelines specifically note that it's recommended to use the XInclude mechanism -- which I think you implied was overly complicated for this type of transformation -- but also says that other general forms of 'pointing' are permissible... If we could use, for instance, a ref element with a @target or @xml:id attribute to point to an external file (that only contains information on the types of elements excluded from the new corpus file?), that would seem to work. My question is, though, is there a benefit in this instance of storing specific element information outside the file when there will be placeholder elements in the corpus to flag them anyways? And if the only elements I'm definitely interested in keeping at this point are the structural ones, would every other element receive its own unique @xml:id in an external file?

zme1 commented 6 years ago

I'm trying to understand this; I fear I'm not stating my questions clearly because I don't quite fully grasp how it works quite yet.

ebeshero commented 6 years ago

@zme1 Well, I think it's complicated because there are a number of different ways to do stand-off annotation. One way is to indicate that something stored in another file may be included at X moment, and that is when you would use an XInclude mechanism. I've seen this in working with the Shelley-Godwin Archive files: they store complicated portions of the TEI header, for example, in an external file kind of like we do a server side include, as a sort of boilerplate that we don't need to clutter every file as we're working, but we pull in when it's needed. I don't know whether you might want to do this--it's worth thinking about, and it isn't very hard to set up--I can show you how, probably in a brief Hangout session is easiest. In oXygen there's a way you resolve XIncludes with a "canonicalize" feature that simply pulls in the material that the XInclude points to when you want to build up the fully complete file.

But stand-off annotation doesn't necessarily mean that you expect to include the other document in the original. The stand-off file can simply refer to the source file, by way of some specific signals that you set.

Either way, you can export the markup that's currently cluttering your space to another file. Whether you choose to include it formally (with XInclude) in the Lega transcript files is up to you, but I'm not sure it's really necessary. You could simply point to a separate file that contains a little extra information about the member activity in each set of minutes, and I think it wouldn't take much to do that, would it? If each set of minutes has its own @xml:id or @when, a stand-off file could contain some kind of structure that you design, something in sections that refer to each set of minutes. If I remember this right, your information about member activity is stored in a list in each set of minutes, right? Am I right in thinking that what you want to remove are the <list> elements, and/or other elements that you supplied that were tracking member activity and not literally part of the transcription? Such elements are what I am recommending you port to another file. The challenge in doing that is to prepare a structure for them, but it doesn't have to be much different from the one you're already using. Perhaps you could organize the standoff file with <div type="memberData"> lists (just making this up now), which contain the <list> elements to pull them out of the way of the transcription. Each div could then correlate to a specific TEI file in your TEI corpus, perhaps hinged on the <date> element in your <teiHeader>. You basically just need some mechanism of connecting the data together if it's been held in a separate file, and so that file can be read as as series of annotations on your TEI corpus of minutes for the Lega. Does that help?

zme1 commented 6 years ago

@ebeshero I think I'm understanding you better now... So, where would that second option leave us as far as the new output? Would those tags just be entirely removed and processed in the stand-off file, or would there be some sort of documentation in its place in the new corpus file?

zme1 commented 6 years ago

And as of right now, I'm trying to run an identity transformation on my corpus file. I think that may be the most succinct way to accomplish this. I'm currently looking for a way to remove the schema line written into my volume1.xml file in the transformation. Any pointers on that front?

zme1 commented 6 years ago

I notice that there is an xml:import-schema element that I can (possibly?) use to refer to the new ODD file that I associate to my new corpus file down the road, but I am still trying to hash out how to remove the old ODD schema line from my original input file with an identity transformation. Maybe, instead of doing an identity transformation, I can just run an xsl:copy of each of the structural tags in my XSLT file, but that seems overly tedious if I can silently process the elements I want (through identity transformation) and focus on the elements I want to relocate.

ebeshero commented 6 years ago

@zme1 I think I've missed a beat here: I'm not sure I'm following why you need to change to a different ODD schema. What about modifying the existing ODD (save your new one in place of the old)?

zme1 commented 6 years ago

@ebeshero I might be able to retain my old ODD without too many problems. I haven't seen how it would adapt with my new project, though. There are parts of the old corpus file that I don't have any need for at this point (like the translation div and the metadata placed in each meeting header on officers). I could further customize my content models to accommodate my very few new elements and attributes. I just deferred to the thinking that the second project should receive a second schema. If you think using the original schema could still work, I could start poking around the ODD to see how easily I could find that balance.

ebeshero commented 6 years ago

@zme1 I'm confused because I thought we were investigating stand-off markup as a way of maintaining a connection to the work of last year, instead of getting rid of it. My motivation for suggesting you use stand-off was to find a way to make these portions of the project be continuous, instead of inconsistent with one another. If you're moving the portions of the project you're not using into a separate stand-off file, the original is going to have to change anyway, so it seems like you might just want to modify your existing ODD accordingly. You may need to revise the content model of the transcript files, but it's probably not a huge change, and it's worth revisiting what you already did to see what you want to keep and document what you want to change.

Okay, so for the new stand-off file, the question is whether it requires its own distinctly new structure. You could write a new ODD just for it, or find a way to make its encoding be consistent with your main transcript file. For the Mitford project, I opted recently to make a separate ODD to govern my prosopography file, because it's a lot different from files I'm transcribing. I guess you're going to need to map out your content models for how things should fit together.

I've never used xml:import-schema, but I have added processing instructions to collections files with XSLT, like this: https://www.w3schools.com/xml/ref_xsl_el_processing-instruction.asp Again, I'm not sure it's going to be necessary for the Toscana minutes. If you're writing a new ODD, it should probably be for the stand-off file.

ebeshero commented 6 years ago

Here's more on matching existing processing instructions to alter them: https://stackoverflow.com/questions/1366878/how-to-match-the-processing-instruction-element-in-xslt

zme1 commented 6 years ago

@ebeshero So, I will still maintain only one copy of the corpus file. Except now, any of the irrelevant markup will be transferred into a new stand-off file and flagged by pointer markup in my corpus file. Instead of stripping my volume 1 file, I'm re-dressing it. Am I describing it more faithfully now?

ebeshero commented 6 years ago

@zme1 That's the idea of the stand-off, yes. You don't have to do this, by the way--it's just an idea for how to deal with an old markup regimen, to sort of peel it off as a separate layer from the original.

You could just start over by ripping out the old markup, and maintain two separate TEI files representing the Lega (your old one, and the new one you're working on). But if you're building in a TEI way, it's probably best (= most sustainable for the long range) to have a single stable TEI file that represents the Lega minutes as the file you'd pass along to others should they wish to continue research.

zme1 commented 6 years ago

@ebeshero What is motivating me to try to figure it out is the very fact that it is the most sustainable, though. I would like to avoid doing something less robust or fully complete if I can. Do you think we could try to arrange that short Hangout call at some point within the next couple days to see if I can get any traction with the process?

ebeshero commented 6 years ago

I'll ping you on Hangouts to see what makes sense for timing...