References export transformation with links

wendellpiez / JATSKit

oXygen XML Editor framework for NISO JATS 1.1 / NLM BITS 2.0

Apache License 2.0

36 stars 32 forks source link

References export transformation with links #45

Open futurion opened 4 years ago

futurion commented 4 years ago

When using "JATS/BITS single HTML page preview" to generate HTML preview from XML, the references are generated ok, but there are a few small issues. I attached the example (references_input.xml) with 7 references.

a) First 4 references from example are tagged within the same line, where references 5-7 are tagged in separate lines. Please run the transformation and you'll see the first 4 generate the right result, while 5-7 have some issues. The issues are gaps (empty spaces) between some elements. For instance reference 5 creates gaps here: May; 138( 5): 585– 8. The result should be without gaps like: May; 138(5): 585–8. I also spotted empty spaces on some other places where signs like "-(),;" are included.

b) Is there an option to fix/upgrade the "jatskit-html.xsl" (or maybe even better - create two new .xsl files) so the references first output will be "naked" without any links (no DOI, no PMID), while the second output would have both DOI and PMID links? I attached the word document (references_export.docx) how the format without links should look like and how it would be nice to improved html export with URL links included should look like. This means, the first format should not have DOI and PMID links at all, while the second format should have both links included, one as direct URL, the other also as URL, but in a format like PMID: 12345678. This will be then useful to create two different reference lists (true Vancouver style) for interactive PDF files and for publishing them online on various places.

references.zip

raducoravu commented 4 years ago

For problem (a) I'm attaching the reply I gave to @futurion via email:

Let's take this case into account:

      <issue>5</issue>):
       <fpage>585</fpage>–
       <lpage>8</lpage>

In the XML document there are indeed those line breaks and spaces between the consecutive inline elements. This means that the HTML document will have the text looking like this:

 5): 585- 8

so those extra spaces between the initial elements in the XML document will translate to a single space in the generated HTML output, leading to the rendering problems. In my opinion the elements in the XML document should have been written on a single line, without any space between them, like:

 <issue>5</issue>):<fpage>585</fpage>–<lpage>8</lpage>

So how did the XML end up this way? Did you manually add line breaks between the elements?

In my opinion this is a problem in how the XML is written, inline elements, elements which will flow in a paragraph need to have no spaces between them if you want them to not have spaces in the HTML output.

For example, let's look at a similar situation in HTML, you have this:

 <b>bold</b><i>italic</i>

because the tags are next to each other, without any new line or space between them, they are considered as being one word in the web browser.

If you had even a single space between them (or a line break):

<b>bold</b> <i>italic</i>

then they would be considered two words.

futurion commented 4 years ago

This is indeed the answer I expected.

Both formats (refs 1-4 and refs 5-7) are direct exports from EdiFix while one is newer (1-4) and one is older (5-7). Apparently they also discovered that format where <mixed-citation> is used which includes extra untagged elements should be exported as inline, due to restructuring issues mentioned above, so they fixed the output now.

However, there are still plenty of other XML-beautifiers out there which will create the exact same structure (each tag separate line), so I'm wondering if there's an option to fix the transforming .xslt to be a bit smarter and would simply eliminate any extra spaces where tags like year, month, volume, issue, fpage, lpage are used?

Besides, I also tried the other <element-citation> format which absolutely doesn't have any untagged strings and all tags are in separate lines. This is a format which would need to add extra spaces and other punctuation signs wherever needed. Let's take this as an example:

<element-citation publication-type="journal" publication-format="print"> <name> <surname>Diaz-Cruz</surname> <given-names>ES</given-names> </name> <name> <surname>Shapiro</surname> <given-names>CL</given-names> </name> <name> <surname>Brueggemeier</surname> <given-names>RW</given-names> </name> <article-title>Cyclooxygenase inhibitors suppress aromatase expression and activity in breast cancer cells</article-title> <source>J Clin Endocrinol Metab</source> <year>2005</year> <month>May</month> <volume>90</volume> <issue>5</issue> <fpage>2563</fpage> <lpage>2570</lpage> <comment>Table 2, Aromatase activity and expression in cell lines; p. 2565</comment> </element-citation>

The result here is even worse, without any spaces and without needed punctuation signs (,:-;). So the xslt parser in this case apparently doesn't work as it should as spaces and signs are not added properly, even if each element is in new line.

ref_element_citation.zip

wendellpiez commented 4 years ago

I am afraid that the number of cases where whitespace should be handled in special ways is too large -- and exactly what those special ways might be is too complex -- for this to be actionable. Without a complete specification (and I am afraid even with one), to automate whitespace handling would require guessing, and those guesses will be 100% correct only rarely.

The solution @raducoravu suggests - ensuring the whitespace in your input adequately represents the whitespace you want on output -- is much better. To support this, one can imagine a pre-processing XSLT that would implement some set of rules to correct whitespace (again, to some definition of "correct"). This at least would provide a way to repair problems that it failed to handle or that it introduced into the data.

element-citation was designed specifically for the case where you have a transformation that does all punctuation for you. Its implementation can typically be tolerant of erroneous whitespace because it ignores all whitespace, instead providing its own. While this has (sometimes) been done, this proves to be a very expensive capability both to design and to maintain, mainly because both the list of citation types (journal article, monograph, book in a series, recorded audio, unpublished manuscript, email to author, public standard etc. etc.), and the list of potential formatting styles (e.g. MLA, Chicago, APA etc. -- all of which also have in-house 'flavors') are unbounded. The public XSLT library that provided me with a starting point for the JATSKit indeed had some of this capability, but due both to their limitations and to the maintenance overhead, no one ever used those XSLTs much (to my knowledge) if at all. Dig into the repository at NCBI (the US agency that spearheaded the development of JATS) for some of this: https://github.com/ncbi/JATSPreviewStylesheets. Note there is a "user guide" and "technical documentation" (in JATS, HTML and PDF) that describes this, as well a directory where a couple of examples of XSLTs providing citation formatting (that might kinda-sorta work) can be found.

So I agree with @raducoravu that the best solution is to use mixed-citation along with XML pre-processes and normalizations that respect your whitespace and do the right thing with it. (Apparently the EdiFix developers felt the same way, or they would not have bothered to improve their tool, thereby making it better than the tools that butcher your data on the incorrect assumption that whitespace never matters.) As mentioned, such a tool could take the form of an XSLT stylesheet -- although again I am not sure you would want a random developer to make up your whitespace-handling rules for you. 😈

Indeed, along those lines it seems to me that with the same level of effort, one could develop a Schematron that could work in the editor and detect unwanted whitespace, perhaps even providing a Quickfix for removing it with a button click -- assuming yet again that we knew what rules a developer should implement.

Proposal (b) seems to me to be more rewarding of the investment of effort. One clarification: @futurion would you rather have external "export" XSLT that would produce standalone citation lists with the formatting as appropriate? Or would this be better as a runtime switch in the basic preview HTML - i.e. show or hide the link targets in the citation - which could be presented in oXygen perhaps as two different Preview options?

Meanwhile thanks for your attention to JATSKit and your efforts to improve it.

In case it's any help thinking about a specification, the elements permitted inside mixed-citation are described here (depending on which variety of JATS you are using):

Presumably, unwanted whitespace (i.e. requiring collapsing to an empty string) could appear between any pair of the elements listed (although I can't tell you which ones) -- or between them and adjacent punctuation.

futurion commented 4 years ago

First of all, thank you for this explanation, I am actually well aware of the reconstructing issues when handling whitespaces, so I'll divide this into three sections:

a) I feel mixed-citation is in this case easier to use then element-citation as it already contains all untagged/unstructured text so there's no need (as you suggested) to build any kind of extra parsers for whitespace or punctuation. The only limitation here is that with mixed-citation you are more or less unable to produce different citation styles (i.e. APA, BibTex, etc.). However, being limited to Vancouver style in this case is actually completely ok as the style is ued actively at PMC and all other online journals and libraries, including OJS lenses, etc.

b) So if we stick to mixed-citation style (edifix inline sample - each citation in one single line with all punctuation and whitespaces), then I have two questions here. b1) Is it in this case totally enough to build as simple as possible xslt transformation which would actually only remove all tags and output the clear-text (with and without links)? This is a few-lines code I guess, but is it enough? Is it possible to structure everything needed just inside mixed-citation format so you can simply just remove all tags and you get the 100% correct Vancouver citation style as clear-text or html output? How in this case other parsers work (the one at PMC for example)? Do they also just remove tags, or do they have some kind of improved citation parsers? b2) When getting a single-line/inline references xml file (i.e. edifix sample), if you want to edit it and fix some references (you usually have to do that), it's quite hard or almost impossible to edit it as original inline (one reference - one line) file. Therefore some xml beautifiers are used, which divide all the tags into user-friendly format (multiple lines - each tag, new line), so it's easier then to manually edit it. My question here remains the same. Is it possible to keep all whitespaces where they initially were in inline format even if you divide the tags into multiple lines? If keeping whitespaces and all needed punctuation after dividing the linline format into user-friendly multiple lines is possible, then I believe the same simple xslt transformation can be used for both formats (inline and multi-line)?

c) So my proposition would be to build (or just improve existing) simple xslt transformation as you suggested which would have a switch (with and without links) and which would be able to build same citation format (Vancouver in this case) no matter if all tags in original xml file are structured as inline or multiline. In my case, the standalone xslt transformation for references only would be more useful as we usually have references in separate xml files, but again, if other oxygen users would like to have everything inside existing HTML preview transformation, it's also fine with me.

wendellpiez commented 4 years ago

Tomas,

Yes, exactly. mixed-citation can in some possible world be reformatted from Vancouver to APA by stripping it of text leaving only elements, then treating it as APA. I don't believe there is enough benefit for the effort for anyone to do this.

More or less the only exception to the rule (or the only one that I can think of) that one can drop all tagging from mixed-citation and get a nicely formatted entry (your question b1) -- providing the text in line is correct to the format you want to see -- is when you have a name inside. It may have some structure that has to be construed / reordered. For this reason among others many shops are using string-name instead, as I understand it. (JATS-List would be an excellent place for these questions btw.) This design feature IMO is one of the best things JATS offers. And yes, renderers do take advantage of it.

Indeed probably the most consistent pattern (if any) would be a fairly rigorous "validation check" on the formatting and markup of citations to address local requirements. Schematron might be commonly used for example. One advantage to developing Schematron rules (essentially very close validation rules such as "if is followed by there must be an en-dash, not a hyphen, no extra space" is that it can be targetted at problems you actually see, without worrying about the general case, as a transformation rule must (in principle) -- it's not expected to be 'lights out' but rather an aid to a semi-automated step that permits human judgement to intervene. This does mean you have to pick your format and forsake (at least for the time being) an aspiration to be able to cast, say, from MLA to APA. But this is a feature not really needed much - and only comes when you have reasonably good control of your bibliographies (better than most) in any case.

In addition to having this feature, indeed, I think it turns out mixed-citation works as well as element-citation with bibliographic databases (Zotero and the like) or at least it's not any worse to map back and forth, depending on the case. Indeed better, since it can capture the punctuation provided by the database renderer (which may indeed be switchable between citation formats for export). You are probably aware of Zotero and https://citationstyles.org/ for example, as proposing generalized solutions to parts of the bibliographic management and exchange problems.

As I think @raducoravu mentioned, however, at least if I understand it, your proposal b2 would not be feasible - XML is itself the encoding, there is no encoding behind it that could distinguish between an LF line feed added because you wanted to see a tag on a new line, and whitespace that it should not display. Possibly half-steps are possible (a transformation that would remove only LF characters for example, no other whitespace) but it would not be general-use (most people don't have that problem - they just don't use tools that change their data) - it would be specific to your workflow.

But you wouldn't need an XSLT if the requirement is simply "remove LF from mixed content". That would be straightforward to code up as a macro in oXygen, for example - a simple search and replace in an element context, which would remove LF and leave everything else alone (thus presumably undoing the damage caused by EdiFix). The oXygen help list could possibly provide assistance in showing you how to do that.

The display of links question is separate, isn't it?

futurion commented 4 years ago

@wendellpiez and @raducoravu While struggling to get the right parser for mixed-citation references, I actually stumbled upon a question which in my opinion needs to be answered prior to making any parsers.

The matter here is actually written a few posts above where Radu exposed an example by comparing two variants:

<issue>5</issue>):
<fpage>585</fpage>–
<lpage>8</lpage>

and

<issue>5</issue>):<fpage>585</fpage>–<lpage>8</lpage>

I searched around to get some mixed-citation specifications, but except that straight-forward definition with tagged and non-tagged elements, I didn't find any real answer to the most crucial question.

Is there any specification which explicitly declares for mixed-citation references to be formatted as "single-line" or "inline"?

If there are no such specifications, then in my opinion, all punctuation is useless. Why? Simply because you always just keep punctuation, but you lose white-spaces which are part of punctuation in this case, as soon as you divide the xml elements into multiple lines. So there I two options:

a) If there is such rule and mixed-citation references sare actually presented as "single-line" strings with all tagged, non-tagged and white-spaces elements, then it's surely enough to just strip all XML tags, and you get back 100% correct result.

b) If the rule like this doesn't exits, then you can of course still use punctuation elements, but you have to write a smart parser which adds extra white-spaces where needed, but as mr. Piez siad, that list is probably enormous.

So, I'm asking myself the following. How do others approach these mixed-citation transformations from JATS XML to clear text? Do they also use special (clever) parsers?

I've tested some (PMC previewer for example) and it also has exact same issues with retaining some spaces, where there should be no spaces at all. Of course, these spaces become visible only when XML structure is split into separate lines.

But on the other hand, I've seen many mixed-citation examples online where strings are initially separated into multiple lines (name, given-name, surname) and parsers cleverly ignore these, so there should be some logic behind, not just removing xml tags.

Lastly, if we talk about element-citation. Is this citation style in any way different then the other one? Is it simply enough to remove all punctuation from mixed-citation and you have element-citation instantly? What is the profit? Is it maybe better to have element-citation for such cases, so it's not "your fault" when strings are not properly distribued (single-line, multi-line), because element-citation always has to be parsed and punctuation added by parser, not by author?

I don't know, as I said, I have more questions than answers. But again, most important questions here are of course if there aare any kind of explicit declarations of having XML citations declared as single or multi-line elements.

wendellpiez commented 4 years ago

@futurion I suggest you subscribe to JATS-List (if you are not already subscribed) and post your question there. Its readers and contributors have the experience and perspective you need.

My 0.02: I am guessing the consensus view would be to (a) use mixed-citation (not element-citation), (b) make sure your content -- including space and LF characters -- is as you want (both for maintenance and representation), and (c) avoid tools that irreversibly change the whitespace, requiring hand intervention to correct it back.

Either a reversible tool, or a citation-styler (i.e., producing the representation you actually want) would serve to mitigate those constraints in operation. However, both of those would be too sensitive to your local requirements to be much use across all JATS. It's possible that some publishers or service providers do manage to develop such tooling -- and might even be willing to share.

But don't take my word for it, ask the list: its readers include not only users and developers but also the designers and maintainers of the format -- as well as community activists who help to design "consensus patterns" for JATS usage. In particular they would be able to give you first-hand information about what they do.

Small tip: I'd avoid the word 'parser' for what you are describing: it is more than what we usually call a parser. "Processor" would be a good (general) word, or "transformation". Indeed this is close to the essence of the problem: a parser to be conformant and complete must not change or discard information (only how it is represented), therefore supporting reversible operations (or at least that is usually the assumption), whereas a transformation can go just one way. A transformation engine such as Saxon or oXygen's internals may be (is usually) built on parser technology to handle XML syntax; but here we are talking about more than syntax since you want to respect a semantic* distinction between, for example, <vol> and <lpage>.

http://www.mulberrytech.com/JATS/JATS-List/index.html

* What constitutes "information" in this context is a fine question, but that is what the W3C XML Recommendation is for. Interestingly, even the Rec does not use the word "parser" for any piece of software that processes XML.

FWIW the Rec does say if you put xml:space='preserve' on any element, a conformant XML processor is required to pass its whitespace through without changing it. But I can't say without researching it whether that would work with the current JATS schemas.