Taxonomic names using `<named-content>` need spaces when more than one term

Daniel-Mietchen commented 10 years ago

Example: in https://en.wikisource.org/wiki/Wikisource:WikiProject_Open_Access/Programmatic_import_from_PubMed_Central/The_land_crab_Johngarthia_planata_%28Stimpson_1860%29_%28Crustacea_Brachyura_Gecarcinidae%29_colonizes_human-dominated_ecosystems_in_the_continental_main , the header template states "Johngarthiaplanata".

Should be "Johngarthia planata". See the article on PMC: PMC4092324

wrought commented 10 years ago

hmm, this is strange, any thoughts on this issue?

wrought commented 10 years ago

Issue seems not to be <italic> per se, for which the XSLT is written in the same way as <bold> (both are successful in preserving spaces), but instead that there already there are no spaces separating the taxonomic elements used to wrap the animal name, as you can see here in the PMC NXML:

<italic><named-content content-type="taxon-name"><named-content content-type="genus">Johngarthia</named-content>
<named-content content-type="species">planata</named-content></named-content></italic>

The original article XML is written thusly:

<tp:taxon-name-part taxon-name-part-type="genus">Johngarthia</tp:taxon-name-part>
<tp:taxon-name-part taxon-name-part-type="species">planata</tp:taxon-name-part>

from http://bdj.pensoft.net/lib/ajax_srv/article_elements_srv.php?action=download_xml&item_id=1161

Perhaps it is prudent to handle the <named-content> tag, specifically when a taxon-name or species value is used for the content-type="" attribute. Right now, it seems these named-content tags are removed, but if there are multiple tags, then they could be replaced by a space character instead.

Klortho commented 10 years ago

FWIW, it shows up in Entrez search results with the same problem: http://www.ncbi.nlm.nih.gov/pmc/?term=PMC4092324.

I'll take a look to see if I can fix it.

Klortho commented 10 years ago

@wrought , the NXML looks okay to me. It has a newline, which should be preserved/normalized, I think. In other words, I think newlines should be converted into spaces in the wikitext.

Klortho commented 10 years ago

The problem is this: <xsl:strip-space elements="*"/>, which causes all spaces inside elements in the input to be stripped in a pretty draconian way. Unfortunately, I don't see an easy solution ... changing named-content to preserve breaks the wikitext.

Klortho commented 10 years ago

Okay, I think I fixed this one special case, but I'm very worried that I broke something else.

Whitespace handling is one of the truly hard problems in document processing, and right now what we have is a pretty bad hack job.

We really need a test framework, where we can put some regression tests in, so we can be sure we're not breaking other things as we work on this.

wpoa / JATS-to-Mediawiki

Taxonomic names using `<named-content>` need spaces when more than one term #24