wpoa / recitation-bot

MediaWiki bot to upload content to Wikimedia projects and update corresponding citations on Wikipedia.
GNU General Public License v3.0
9 stars 3 forks source link

Keep "forbidden characters" from paper titles in page titles at Wikisource #16

Open Daniel-Mietchen opened 10 years ago

Daniel-Mietchen commented 10 years ago

At OAMI, the file naming of the uploads to Commons gets rid of many special characters. At Wikisource, we should strive to keep the paper titles as intact as possible (see also https://github.com/wpoa/recitation-bot/issues/15 ), taking into account technical limitations of MediaWiki (e.g. colons or slashes in page names).

Daniel-Mietchen commented 10 years ago

The paper at http://dx.doi.org/10.1186/1742-4690-2-11 has a colon in the title that was not brought over to https://en.wikisource.org/wiki/Wikisource:WikiProject_Open_Access/Programmatic_import_from_PubMed_Central/The_Vpr_protein_from_HIV-1_distinct_roles_along_the_viral_life_cycle . I would be inclined to keep the colon (and did so in a manual move of a previously imported version to https://en.wikisource.org/wiki/Wikisource:WikiProject_Open_Access/Programmatic_import_from_PubMed_Central/The_Vpr_protein_from_HIV-1:_distinct_roles_along_the_viral_life_cycle ), but I am not entirely sure what is policy or practice at Wikisource on this.

wrought commented 10 years ago

Seems like a fringe case, the colons are considered a "forbidden character" by OAMI and we copied the title cleaning function from there. You can see it here.

wrought commented 10 years ago

Also, we should currently be keeping all Unicode characters, and simply eliminating a small number of "forbidden characters". I'll update the title to reflect this.

Daniel-Mietchen commented 9 years ago

The OAMI rules were set up with Commons in mind, and I think we should leave our Commons-facing naming rules like this until the time we can pull all this info from Wikidata.

For Wikisource, I'd agree that we should just exclude "forbidden characters" and keep everything else.