Closed Daniel-Mietchen closed 10 years ago
hmm, this is strange, any thoughts on this issue?
Issue seems not to be <italic>
per se, for which the XSLT is written in the same way as <bold>
(both are successful in preserving spaces), but instead that there already there are no spaces separating the taxonomic elements used to wrap the animal name, as you can see here in the PMC NXML:
<italic><named-content content-type="taxon-name"><named-content content-type="genus">Johngarthia</named-content>
<named-content content-type="species">planata</named-content></named-content></italic>
The original article XML is written thusly:
<tp:taxon-name-part taxon-name-part-type="genus">Johngarthia</tp:taxon-name-part>
<tp:taxon-name-part taxon-name-part-type="species">planata</tp:taxon-name-part>
from http://bdj.pensoft.net/lib/ajax_srv/article_elements_srv.php?action=download_xml&item_id=1161
Perhaps it is prudent to handle the <named-content>
tag, specifically when a taxon-name
or species
value is used for the content-type=""
attribute. Right now, it seems these named-content
tags are removed, but if there are multiple tags, then they could be replaced by a space character
instead.
FWIW, it shows up in Entrez search results with the same problem: http://www.ncbi.nlm.nih.gov/pmc/?term=PMC4092324.
I'll take a look to see if I can fix it.
@wrought , the NXML looks okay to me. It has a newline, which should be preserved/normalized, I think. In other words, I think newlines should be converted into spaces in the wikitext.
The problem is this: <xsl:strip-space elements="*"/>
, which causes all spaces inside elements in the input to be stripped in a pretty draconian way. Unfortunately, I don't see an easy solution ... changing named-content
to preserve
breaks the wikitext.
Okay, I think I fixed this one special case, but I'm very worried that I broke something else.
Whitespace handling is one of the truly hard problems in document processing, and right now what we have is a pretty bad hack job.
We really need a test framework, where we can put some regression tests in, so we can be sure we're not breaking other things as we work on this.
Example: in https://en.wikisource.org/wiki/Wikisource:WikiProject_Open_Access/Programmatic_import_from_PubMed_Central/The_land_crab_Johngarthia_planata_%28Stimpson_1860%29_%28Crustacea_Brachyura_Gecarcinidae%29_colonizes_human-dominated_ecosystems_in_the_continental_main , the header template states "Johngarthiaplanata".
Should be "Johngarthia planata". See the article on PMC: PMC4092324