wpoa / open-access-media-importer

A tool for harvesting media files from Open Access articles for upload into Wikimedia Commons
http://commons.wikimedia.org/wiki/User:Open_Access_Media_Importer_Bot
23 stars 8 forks source link

Category blacklist #92

Open Daniel-Mietchen opened 11 years ago

Daniel-Mietchen commented 11 years ago

Some categories that the bot puts files in are not wanted over at Commons, e.g. "Middle aged" and "Young adult". I think the best way to handle that would be to have a blacklist of categories that can be adjusted by the user.

Perhaps we could also have a "grey list" of categories that do not exist on Commons yet, which could then either be created or blacklisted.

Daniel-Mietchen commented 11 years ago

OK, let's start the list here (ordered alphabetically, to facilitate maintenance):

Antarctic ocean Article Methods article‎ Middle aged Monte carlo method Motor neuron disease North carolina Original Articles Original paper Original research article Research Article Research article Research articles Review Article Statistical methods Swine Young adult

Daniel-Mietchen commented 11 years ago

Once we have that in place, we can drop the workaround of deleting one-word categories, and put the unsuitable ones on the blacklist instead.

erlehmann commented 10 years ago

I thought we already filter single-word categories?

erlehmann commented 10 years ago

sorry, i thought we want to filter all single-word categories. did not realize it is a workaround.

erlehmann commented 10 years ago

can you give me an article with unsuitable categories so i can write up a test case?

erlehmann commented 10 years ago

I think the blacklist belongs into the template construction, right before upload. Due to category name capitalization postprocessing, only one of “Research article” and “Research Article” should be filtered – I think it is the former.

Daniel-Mietchen commented 10 years ago

The blacklist should come after the automated capitalization postprocessing, since the rules used there also sometimes introduces errors (e.g. "North carolina" or "Monte carlo method"). So we need to blacklist both “Research article” and “Research Article”.

Some files with (initial) categories listed under https://github.com/erlehmann/open-access-media-importer/issues/92#issuecomment-24354554 are: https://commons.wikimedia.org/wiki/File:Exploring-the-acquisition-and-production-of-grammatical-constructions-through-human-robot-Movie1.ogv https://commons.wikimedia.org/wiki/File:Distributed-organization-of-a-brain-microcircuit-analyzed-by-three-dimensional-modeling-the-Movie1.ogv https://commons.wikimedia.org/wiki/File:Deeper-Penetration-of-Erythrocytes-into-the-Endothelial-Glycocalyx-Is-Associated-with-Impaired-pone.0096477.s002.ogv https://commons.wikimedia.org/wiki/File:An-Internet--and-Mobile-Based-Tailored-Intervention-to-Enhance-Maintenance-of-Physical-Activity-jmir_v16i3e77_app2.ogv https://commons.wikimedia.org/wiki/File:Animated-Randomness-Avatars-Movement-and-Personalization-in-Risk-Graphics-jmir_v16i3e80_app1.ogv https://commons.wikimedia.org/wiki/File:Supporting-Patients-Treated-for-Prostate-Cancer-A-Video-Vignette-Study-With-an-Email-Based-jmir_v16i2e63_app1.ogv https://commons.wikimedia.org/wiki/File:A-Demonstration-of-Nesting-in-Two-Antarctic-Icefish-%28Genus-Chionodraco%29-Using-a-Fin-Dimorphism-pone.0090512.s002.ogv https://commons.wikimedia.org/wiki/File:Regulated-aggregative-multicellularity-in-a-close-unicellular-relative-of-metazoa-elife01287v001.ogv https://commons.wikimedia.org/wiki/File:Subcellular-and-supracellular-mechanical-stress-prescribes-cytoskeleton-behavior-in-Arabidopsis-elife01967v002.ogv https://commons.wikimedia.org/wiki/File:Deeper-Penetration-of-Erythrocytes-into-the-Endothelial-Glycocalyx-Is-Associated-with-Impaired-pone.0096477.s001.ogv https://commons.wikimedia.org/wiki/File:Drama-based-education-to-motivate-participation-in-substance-abuse-prevention-1747-597X-2-11-S1.ogv