wpoa / JATS-to-Mediawiki

A PubMed Central to MediaWiki converter
4 stars 1 forks source link

Apostrophe in italics turns into bold #47

Open Daniel-Mietchen opened 10 years ago

Daniel-Mietchen commented 10 years ago

Search for "homies" in https://en.wikisource.org/w/index.php?title=Wikisource:WikiProject_Open_Access/Programmatic_import_from_PubMed_Central/The_Biodiversity_of_the_Mediterranean_Sea_Estimates_Patterns_and_Threats&oldid=5034390 and http://dx.doi.org/10.1371/journal.pone.0011842

Not sure we can fix this algorithmically, though - it is encoded in a really strange fashion:

<italic>Prud</italic>'<italic>homies</italic>

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pmc&id=PMC2914016

Klortho commented 10 years ago

That really looks like a data error, in my estimation. Why would they italicize the word, but leave the apostrophe inside the word non-italicized? Anyway, I was able to make this work with the following wikitext:

Here's an italic word with a non-italic apostrophe:  ''Prud''&#39;''homies''

which suggests that the solution is to always escape apostrophes as numeric character references. But that makes the wikitext pretty ugly, unfortunately. On the other hand, I don't see any other way to robustly differentiate apostrophes-as-markup vs apostrophes-as-content.

What do you think?