wpoa / JATS-to-Mediawiki

A PubMed Central to MediaWiki converter
4 stars 1 forks source link

what to do about equations relying on images #22

Open notconfusing opened 10 years ago

notconfusing commented 10 years ago

they don't need to go on commons, but it also wont work to upload them locally, because they won\t look right not inline. as in https://en.wikisource.org/wiki/Wikisource:WikiProject_Open_Access/Programmatic_import_from_PubMed_Central/Ranking_Candidate_Disease_Genes_from_Gene_Expression_and_Protein_Interaction_A_Katz-Centrality_Based_Approach

what to do @Daniel-Mietchen ?

Klortho commented 10 years ago

What's the PMCID? If the XML contains the source TeX or MathML, then it should be rendered with MathJax on the wiki.

notconfusing commented 10 years ago

PMCID: PMC3166320

https://en.wikisource.org/wiki/Wikisource:WikiProject_Open_Access/Programmatic_import_from_PubMed_Central/Ranking_Candidate_Disease_Genes_from_Gene_Expression_and_Protein_Interaction_A_Katz-Centrality_Based_Approach

Max Klein ‽ http://notconfusing.com/

On Fri, Jul 25, 2014 at 5:04 PM, Chris Maloney notifications@github.com wrote:

What's the PMCID? If the XML contains the source TeX or MathML, then it should be rendered with MathJax on the wiki.

— Reply to this email directly or view it on GitHub https://github.com/wpoa/JATS-to-Mediawiki/issues/22#issuecomment-50162105 .

Daniel-Mietchen commented 10 years ago

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pmc&id=PMC3166320 does not have anything other than the images, it seems. Example:

Then we drop the superscript and write Eq. (2) on matrix format as
<disp-formula><graphic xlink:href="pone.0024306.e003"/><label>(3)</label></disp-formula>
where <bold>d</bold>  =  (1,…,1)<italic><sup>T</sup></italic>. Which gives
<disp-formula><graphic xlink:href="pone.0024306.e004"/><label>(4)</label></disp-formula>

The same goes for the XML available from PLOS directly. Seems to be a clear case for JATS4R. Will open a ticket there and have it point here.

notconfusing commented 10 years ago

if this is detecatble, which it seems it is from <disp-formula><graphic..> then we can detect and warn with upload.

Max Klein ‽ http://notconfusing.com/

On Fri, Jul 25, 2014 at 11:36 PM, Daniel Mietchen notifications@github.com wrote:

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pmc&id=PMC3166320 does not have anything other than the images, it seems. Example:

. Then we drop the superscript and write Eq. (2) on matrix format aswhere d  =  (1,…,1)T. Which gives

— Reply to this email directly or view it on GitHub https://github.com/wpoa/JATS-to-Mediawiki/issues/22#issuecomment-50207644 .

Daniel-Mietchen commented 10 years ago

Here is another example, in which <disp-formula><graphic..> is not used:

<p>Haplotype diversity was estimated as
</p>
<p><inline-graphic xlink:href="1471-2156-5-26-i1.gif"/>
</p>

(from https://en.wikisource.org/wiki/Wikisource:WikiProject_Open_Access/Programmatic_import_from_PubMed_Central/Most_of_the_extant_mtDNA_boundaries_in_South_and_Southwest_Asia_were_likely_shaped_during_the_initial_settlement_of_Eurasia_by_anatomically_moder#Data_analysis ).

Daniel-Mietchen commented 10 years ago

An example with loads of formula images is at https://en.wikisource.org/wiki/Wikisource:WikiProject_Open_Access/Programmatic_import_from_PubMed_Central/In_Silico_Gene_Prioritization_by_Integrating_Multiple_Data_Sources .

wrought commented 10 years ago

I have not found any relevant open source tools for this.

Using OCR on the PDF of such articles (via peerlibrary, which is stable) is very inaccurate (e.g. an upper-case Sigma character is matched with an upper-case "X" character): https://peerlibrary.org/p/ycXY3dk2LFGHsDfE2

There is one extraction library, but it requires the image files to contain original TeX or LaTeX in the file metadata, otherwise it won't work. It doesn't work with images from the named PLOS article above. http://www-cdf.fnal.gov/~cplager/latex/#png

Converting such many equations into accurate TeX, MML or equivalent source text can be achieved in two ways:

wrought commented 10 years ago

Hmm, just had a clever thought.

Can we upload the equation images to Wikisource? In effect, these raster images are non-source text that need to be manually transcribed because the academic record does not currently preserve these data in a "free as in freedom" (i.e. plain text, machine-readable, re-usable) format.

In a way, this brings our project also very close to the common use case for Wikisource -- transcription!

There is not a better place for manual (or assisted or programmatic) transcription of license-compatible academic text that is stored in a non-usable format.

You get the same thing from scanning the first issues of Nature, as you would from including bitmap (PNG) files along with a digital open access article.

wrought commented 10 years ago

I think it qualifies under the guidelines and we should move in this direction: https://en.wikisource.org/wiki/Wikisource:Image_guidelines

Daniel-Mietchen commented 10 years ago

Thanks for this one. I am so used to putting every media file up on Commons that I hadn't even considered the possibility of putting the equation images on Wikisource, but I agree that this sounds like a good solution.

wrought commented 10 years ago

Great!

And to address Max's initial concern, the Wikipedia guidelines show that vertical-align:middle; is the preferred display CSS for inline math, and the fact that equations may increase line-height (aka leading) is to be expected: https://en.wikipedia.org/wiki/Wikipedia:Math#Alignment_with_normal_text_flow

So I think displaying these images inline as-is, and if necessary, wrapping with <span style="vertical-align:middle;">, </span> should be sufficient.

Does this path work then?