PDF to XML converter - Githubissues

yoavram / markx

Markdown editor for scientific writing. Batteries included.

Other

319 stars 43 forks source link

PDF to XML converter #4

Open karthik opened 11 years ago

karthik commented 11 years ago

One great thing to enhance scholarly writing would be to convert this to semantic markup. This tool http://pdfx.cs.man.ac.uk/ might be super handy for us because we could first export to PDF, then programmatically convert to xml. I'll leave it here as a placeholder.

yoavram commented 11 years ago

I've written a working python client for this web service (https://gist.github.com/4351598) It takes some time to get a response from the website - about 30-60 seconds - so I'm not sure how to integrate it to markx.

tolot27 commented 11 years ago

The right way would be to convert it to (X)HTML (or DocBook/OpenDocument XML) via Pandoc and then apply a stylesheet to get the desired xml. Converting from PDF will definitively loose information, especially on two-column layouts, even if the application from http://www.scfbm.org/content/7/1/7 is used.