wpoa / OA-signalling

A project to coordinate implementing a system to signal whether references cited on Wikipedia are free to reuse
https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Open_Access/Signalling_OA-ness
GNU General Public License v3.0
19 stars 4 forks source link

Launching RfC on Wikisource #53

Closed Daniel-Mietchen closed 10 years ago

Daniel-Mietchen commented 10 years ago

About mass-importing full-text OA articles into en.ws

notconfusing commented 10 years ago

@Daniel-Mietchen the text is now up. https://en.wikisource.org/wiki/Wikisource:WikiProject_Open_Access/Programmatic_import_from_PubMed_Central

issues on

https://en.wikisource.org/wiki/Wikisource:WikiProject_Open_Access/Programmatic_import_from_PubMed_Central/Taxonomic_revision_of_the_olingos_%28Bassaricyon%29,_with_description_of_a_new_species,_the_Olinguito

@wrought jats-to-mediawiki didn't make a refernces section on this article @Daniel-Mietchen can you run OAMI for images on this article?

Daniel-Mietchen commented 10 years ago

@wrought @Klortho The Olinguito article does not just miss a reference, but half of the article, apparently due to some issue with the header of table 3.

@notconfusing OAMI is not yet set up to import images, but most of the ones from this article are already on Commons, and I will do the rest manually once the text is complete.

notconfusing commented 10 years ago

@wrought @Klortho . Just want to make sure that I am doing this right. The output is a result of theses commands. Perhaps I am not doing it right, because I am feeding it the .nxml file? Am I supposed to give it all the numbered files? What's the difference? And also an error appears, that an externality is missing see:

notconfusing@eigenzorg:~/workspace/JATS-to-Mediawiki$ ls Zookeys_2013_Aug_15_\(324\)_1-83
license.txt               ZooKeys-324-001-g010.jpg  ZooKeys-324-001-g020.jpg
ZooKeys-324-001-g001.gif  ZooKeys-324-001-g011.gif  ZooKeys-324-001-g021.gif
ZooKeys-324-001-g001.jpg  ZooKeys-324-001-g011.jpg  ZooKeys-324-001-g021.jpg
ZooKeys-324-001-g002.gif  ZooKeys-324-001-g012.gif  ZooKeys-324-001-g022.gif
ZooKeys-324-001-g002.jpg  ZooKeys-324-001-g012.jpg  ZooKeys-324-001-g022.jpg
ZooKeys-324-001-g003.gif  ZooKeys-324-001-g013.gif  ZooKeys-324-001-g023.gif
ZooKeys-324-001-g003.jpg  ZooKeys-324-001-g013.jpg  ZooKeys-324-001-g023.jpg
ZooKeys-324-001-g004.gif  ZooKeys-324-001-g014.gif  ZooKeys-324-001-g024.gif
ZooKeys-324-001-g004.jpg  ZooKeys-324-001-g014.jpg  ZooKeys-324-001-g024.jpg
ZooKeys-324-001-g005.gif  ZooKeys-324-001-g015.gif  ZooKeys-324-001.nxml
ZooKeys-324-001-g005.jpg  ZooKeys-324-001-g015.jpg  ZooKeys-324-001.pdf
ZooKeys-324-001-g006.gif  ZooKeys-324-001-g016.gif  zookeys.324.5827-treatment1.xml
ZooKeys-324-001-g006.jpg  ZooKeys-324-001-g016.jpg  zookeys.324.5827-treatment2.xml
ZooKeys-324-001-g007.gif  ZooKeys-324-001-g017.gif  zookeys.324.5827-treatment3.xml
ZooKeys-324-001-g007.jpg  ZooKeys-324-001-g017.jpg  zookeys.324.5827-treatment4.xml
ZooKeys-324-001-g008.gif  ZooKeys-324-001-g018.gif  zookeys.324.5827-treatment5.xml
ZooKeys-324-001-g008.jpg  ZooKeys-324-001-g018.jpg  zookeys.324.5827-treatment6.xml
ZooKeys-324-001-g009.gif  ZooKeys-324-001-g019.gif  zookeys.324.5827-treatment7.xml
ZooKeys-324-001-g009.jpg  ZooKeys-324-001-g019.jpg  zookeys.324.5827-treatment8.xml
ZooKeys-324-001-g010.gif  ZooKeys-324-001-g020.gif
notconfusing@eigenzorg:~/workspace/JATS-to-Mediawiki$ xsltproc jats-to-mediawiki.xsl Zookeys_2013_Aug_15_\(324\)_1-83/ZooKeys-324-001.nxml > ZooKeys-324-001.mw.xml
Zookeys_2013_Aug_15_(324)_1-83/ZooKeys-324-001.nxml:1: warning: failed to load external entity "Zookeys_2013_Aug_15_(324)_1-83/JATS-archivearticle1.dtd"
rnal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd"
notconfusing commented 10 years ago

This is solved, and yes that is the right way to convert the xml. The problem was python xml.ElemenTree.etree not handling <br/> but now @wrought is converting those into newlines.

So @Daniel-Mietchen problematic article is fixed. Ready for you to upload the images and launch the RfC

Daniel-Mietchen commented 10 years ago

We're making good progress here, but some details still remain to be addressed.

Please import the other articles from the test set in https://github.com/Daniel-Mietchen/OA-signalling/issues/37#issuecomment-42750689 as well.

I'll go through these too and launch the RfC once the majority of the bugs listed at https://en.wikisource.org/wiki/Wikisource:WikiProject_Open_Access/Programmatic_import_from_PubMed_Central#Bugs are fixed.

I'll refrain from editing the articles' wiki pages manually, so as to avoid overwrites like https://en.wikisource.org/w/index.php?title=Wikisource%3AWikiProject_Open_Access%2FProgrammatic_import_from_PubMed_Central%2FThe_Vpr_protein_from_HIV-1%3A_distinct_roles_along_the_viral_life_cycle&diff=4899177&oldid=4894578 .

notconfusing commented 10 years ago

@Daniel-Mietchen the rest of the articles in #37 are up for your perural. We spot checked them, and are reporting some of those bugs. For instance: 10.1371/journal.pbio.0020207 displays citations, but does not get a Reflist. And then there are some more JATS to mediawiki problems, like breaking with complex elements in tables. Plase report the rest.

Daniel-Mietchen commented 10 years ago

Draft page: https://en.wikisource.org/wiki/Wikisource:WikiProject_Open_Access/Programmatic_import_from_PubMed_Central/Draft_RfC

Policy at https://en.wikisource.org/wiki/Wikisource:Requests_for_comment is to ask at https://en.wikisource.org/wiki/Wikisource:Scriptorium first.

Daniel-Mietchen commented 10 years ago

Posted as "proposal" on the Scriptorium: https://en.wikisource.org/w/index.php?title=Wikisource:Scriptorium&oldid=4925187#Automated_import_of_openly_licensed_scholarly_articles .