zme1 / toscana

A repository to house research and web development for the Lega Toscana project, led by professor Lina Insana (Spring 2018) and professor Lorraine Denman (Fall 2018), and with consultation from members of the DH Advanced Praxis group at the University of Pittsburgh at Greensburg.
http://toscana.newtfire.org
3 stars 1 forks source link

HTML Minute Reading View #33

Closed zme1 closed 6 years ago

zme1 commented 6 years ago

I am hoping to make some serious progress on my reading view development today, and I'm trying to decide which HTML rendering will be best for a dynamic reading view. As of now, I can't plan to align each version more precisely than a page-by-page basis (but I hope to play around with more advanced features between the end of the term and the CSIS conference).

Right now, I am transforming each year separately, so that they all have their own full HTML page. Within each page, I want the user to decide both which month of minutes they want to read, and which one or two different versions they want to study. I roped each month's minutes off in a section with an @id attribute whose value is the ISO date of the meeting. Inside each section, I have three div elements, one for the manuscript scans, one for the transcription, and one for the translation (which I haven't yet converted...).

Is this a transformation that will be conducive to JavaScript implementation? Also, would flexbox or grid be viable CSS styling options so as to split the viewing window in half if the user selects two versions of the text and allow one to fill the window more completely if they only select one? @ebeshero

zme1 commented 6 years ago

Also, would it be best to work with the translation in XML and transform it to HTML, or should I just encode it directly into HTML? I don't plan on encoding the text at all, but something tells me that it'd be better to start with XML...

ebeshero commented 6 years ago

@zme1 In a nutshell: Yes and yes, and it'll take some tinkering! One question is whether you'd like the complete minutes to be produced statically (from already present code), or if you want to reach into eXist-db to make your search run dynamically. Imagine some simple XQuery scripts to retrieve minutes by month, and then by date, and then by version. We can set up something with PHP, if you like, to reach in and dynamically transform your XML file for view on the website--to handle the search and retrieve part of the user interaction. Let's set up a special meeting to set that up if it sounds like something you'd like to try.

JavaScript could simply handle the search and retrieve as well, if you have output everything on gigantic pages and have most things hidden. You can retrieve just the passages you want to show and hide the rest. (Basically you have options here to think about.)

As for displaying or hiding translations, yes, that would involve another layer of JavaScript, and some careful page design. I definitely like CSS flexboxes for this sort of thing because it's not very cumbersome to write once you understand how it works.

ebeshero commented 6 years ago

@zme1 My post above was in answer to your first post. As for the translations, it doesn't seem right to me to encode that in HTML when you have an XML base. 1) Don't you need to associate the translation material with specific minutes? 2) Can you incorporate the Translation directly into the Italian TEI, by embedding it piece by piece into the appropriate place in the minutes? 3) If it 's a separate file, you could up-convert it to XML with regex, and attach xml:ids to help show matching points with the TEI file...and do an XSLT ID transform to fold it into your file. 4) Think of your TEI XML as a database in which ALL of this information should be readily available, like pulling out desk drawers or file folders in a filing cabinet.

zme1 commented 6 years ago

@ebeshero I think the run time on my site is already a little slow, so the idea of bloating the site with more invisible code seems unattractive at this point. Since the pages will be split by year, they won't be as monstrous in size as my keyMembers.html page currently is, so I'm not certain as to how well or poorly the server would handle these minutes.

As for your second response, that seems like it could work, although it would necessitate some re-designing of my ODD file and XML superstructure. What do you recommend for something like this? Maybe a div element that contains each page?

ebeshero commented 6 years ago

@zme1 What does the translation look like at this point? As much as possible, making it available within the current structure would be best, and by that I mean, physically associating each translated set of minutes with its counterpart in the current file, would be optimal. You need a clear and precise way to associate each unit of Italian text with its English counterpart.

zme1 commented 6 years ago

As of right now, the translation is in our project's Google Drive, inside a single Word document. The page breaks are all present, so aligning the texts at the page-level seems like it wouldn't be too difficult an undertaking.

zme1 commented 6 years ago

@ebeshero

ebeshero commented 6 years ago

@zme1 The more I think about this, the more the eXist-db solution seems appropriate: Query your XML as a database, use PHP to deliver it, minimize the JavaScript.

ebeshero commented 6 years ago

@zme1 Ahhh, you've got some regex work now. Do you want to keep the translation in an entirely separate "mirror" file from the Lega? Or do you want to try to fold it in to that TEI?

zme1 commented 6 years ago

@ebeshero I started to restructure my XML so that every ab element can only legally contain div elements (of @type value "transcription" or "translation") and quickly remembered that there will be a huge amount of overlapping elements if we take this route. Names, proposals, lists, notes, and compensations all continue through page breaks throughout the volume. Could I just have one transcription div and one translation div in each minute, but record all the page breaks within them for alignment on the reading page later?

zme1 commented 6 years ago

@ebeshero I've never used a database query before, so I'd like to try!

ebeshero commented 6 years ago

@zme1 Database query: Let's schedule a time to work on this--I think we need a couple of hours.

Translation structure: You should do this in whatever way makes sense! Does your translation file actually cover / match every piece of the minutes, or does it skip, say, lists of names, and just translate a portion of each section? I ask this because you want to think about what you're aligning and excluding stuff that's not part of the alignment.

I don't know what's in your translation file, but I gather it's basically unencoded text at this point, with page-breaks marked. You're looking for a quick way to align this, so maybe it's reasonable to think of the translation as something like the page image files. If a <pb/> element in your Lega TEI has a @facs pointing to its associated page-image, perhaps it might also have an attribute (check, should this be @target?) that points to a URL containing the translation for that page. Then you chunk the Translation document into separate files containing what you need. I still think you want to encode that in simple TEI (which should be pretty easy to transform), because the translation is data for the project and you may want to write XSLT or XQuery to associate pieces together for other reasons than the reading view. (Transforming to HTML wouldn't be a big deal, and indeed, you may want to "stitch" the data together on a query of the database, so on search of the Italian, you're also able to grab an associated English translation from translation file. ) What do you think?

zme1 commented 6 years ago

@ebeshero To respond to your question of translation material, it is a comprehensive translation of all the content in the entire volume, and it is unencoded with page breaks labelled.

I just revised my ODD file to potentially contain both the transcription and translation; the body can now contain a div element of @type "transcription" or "translation", and I just ran a Regex to run it. The schema and file structure both check out now. In my XSLT file that I started two days ago, I was able to successfully align each manuscript scan with the correct meeting. If I were to encode a diluted translation (with page and line breaks where appropriate), would I need the @facs attribute, since in the original document the two versions would be adjacent to each other, and in the transformation all three would be contained inside the same section element?

ebeshero commented 6 years ago

@zme1 Well, the @facs was to contain a page image URL (not to do with the translation). I was suggesting an @target to accompany it if you were holding the translation in a separate standalone file that you wanted to connect with your TEI. If you're embedding the English translation into the Lega TEI, that's a separate issue. I imagine you would then break up the translated passages "fore" and "aft" of your <pb> elements in the Lega TEI, without needing to signal how they fit (because you've literally set them in exactly where they fit in the source document).

zme1 commented 6 years ago

@ebeshero I just pushed an example of the prospective html structure. I am working only with the year 1925 right now. I added a translation for every meeting in my xml/meetingMinutes/volume1.xml file for that year, and I also created a reading view XSLT file: xslt/html/minutes_readingView_byYear.xsl and pushed all of it to the server. The output is available at html/meetingMinutes/minutes_1925.html.

I'm not certain how close or far this is from what you were suggesting as potential output, whether we end up using XQuery or just JavaScript to shape the page.

ebeshero commented 6 years ago

@zme1 I see the new XML and HTML, but I think the XSLT in the repo must not be current (no <div class="translation"> is accounted for in it). As for the structure, I see what you're doing: You're setting the Italian transcription for a given set of minutes in its own TEI div, and an English translation following it in its own TEI div. Each one contains multiple pages.

Every set of minutes seems to include one or more pages, and there are one or more page images in that set of minutes. You have one complete translation, and one complete transcription with page boundaries marked in each. You should be able to build something around that for your web interface. Think about how the layout pieces fit together. If you want to make page images visible, it seems like you'll be showing two or three for each set of minutes, and that likely won't be a problem. You may want to make a set of medium and small ("thumbnail") size images available to expand on click.

What is "a page" in the HTML site on this project? Is an HTML page one date's minutes? Or is it one year's minutes? Or could it/ should it be both depending on how much the visitor wants to take in at once? (That could be something you can let the user decide on entering some parameters--with some XQuery.)

zme1 commented 6 years ago

@ebeshero I didn't save my XSLT file in its most recent form, but just pushed the most recently updated version to the repo. I'm still not entirely decided on what will constitute an entire page. What I see in my mind's eye is the seven subsidiary pages, each for a full year. Then from there, the user can select which individual meeting they want to read and in which form they want to read it at a given point through a toggle menu with designated options for them. Although I'm not set on that yet; if you think that another alternative will work better I'm more than open to suggestions.

ebeshero commented 6 years ago

@zme1 I'm used to looking at page views on the Shelley-Godwin Archive lately, and you've seen how our Emily Dickinson project organizes page images with relation to text. When you've got page images to work with, you may want to think of the smallest aligned units possible:

ebeshero commented 6 years ago

I know these are difficult questions and I don't really have a strong opinion one way or the other! I think the horizontal layout here is probably the most challenging question: Once you figure out how to lay out the page images next to transcription and/or translation (and is there anything else you want to appear?) that may help you figure out how you want to handle the minutes for a whole year.

To me one of the interesting questions that I don't explore as often as I'd like is how to design a query engine that would retrieve, say, all the minutes that mention a person or special event or keyword in them. I might want to retrieve minutes for January 15, 1919 and October 5, 1922, and skip everything else unrelated to my query. Then again, I might want to be able to just read everything consecutively. Can we accommodate both kinds of visitor?

ebeshero commented 6 years ago

@zme1 Sorry for multiple messages--but I hope some of this is helpful. I think I'd better explain how XQuery + XSLT and PHP might work for a dynamic interface, so you can think about how the pieces can fit together.

You know how to do pull processing with XQuery: You can quickly write a short XQuery script to pull out all the minutes in their TEI files in the teiCorpus of Lega. (And you can quickly customize that to get all the minutes of a certain date or that contain something inside a given element, and to retrieve all the English translations, etc.)

Here's the new idea. Imagine a web form in which you invite viewers to enter what they want to read. Say it's English translations only, for the year 1922. You might imagine writing that in HTML, but we'd use PHP for this, which contains HTML elements, but also special PHP elements that hold search parameters and configurations for securely speaking to eXist.

We fine-tune some XQueries to accept input and fire when triggered by PHP--they return results in the form of XML or HTML--and they don't have to be full HTML pages: they can be pieces of HTML that you tuck into a larger already-constructed page (negotiated by PHP).

That is a sort of overview of how PHP can work in designing a dynamic interface where you can retrieve several results or just a few, depending on what the visitor wants to see. CSS styles the whole thing.

zme1 commented 6 years ago

@ebeshero I didn't even consider your last proposal until now, and I think it's a great idea. I think the more selection we can open up for the reading of the minute volume, the better.

ebeshero commented 6 years ago

@zme1 Great! I'd work on this in stages: Start with some static output and play with the CSS to look at a bunch of layouts. (Think about the side-by-side and vertical layout, etc). Make a mockup. Once you're happy with a layout, let's think about the selection process and how you want to disclose views of the minutes.

You should make the TEI file fully available, by the way, by link from the website--that's responsible sharing of DH data and those in the know can query and review it as they wish. The selective reading view thing can hide things as well as show them. Maybe there ought to be a full reading view of everything consecutively, and a searchable view for people who want to pull related material together?

zme1 commented 6 years ago

@ebeshero I'm working on the CSS right now, and I'll let you know when I push or if I have any problems!

zme1 commented 6 years ago

@ebeshero I just pushed my new minutes.css file, along with updated output for 1925.

zme1 commented 6 years ago

Surprisingly enough, the translation and transcription seem to align reasonably well without me intervening on the default position of the relative div elements...