Avoid having no topic categories

wpoa / open-access-media-importer

A tool for harvesting media files from Open Access articles for upload into Wikimedia Commons

http://commons.wikimedia.org/wiki/User:Open_Access_Media_Importer_Bot

23 stars 8 forks source link

Avoid having no topic categories #54

Open Daniel-Mietchen opened 11 years ago

Daniel-Mietchen commented 11 years ago

We currently strip off single-word categories. However, about 600 of the files uploaded so far had no content categories at all, which is much harder to manage than the overcategorization that we have with most of the other files.

So I would suggest that after stripping the single-word categories, we check whether any categories are left, and if there are none, we take the single-worded ones back in.

erlehmann commented 11 years ago

This is easy, but please list some test cases so I can verify my solution.

Daniel-Mietchen commented 11 years ago

Problem is that I don't have an easy way to tell whether an article had single-word categories that we stripped off.

What I could provide is an incomplete list of articles whose files ended up having no content categories upon upload.

Latest example: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3522013/ , giving rise to http://commons.wikimedia.org/wiki/File:Vincristine-enhances-amoeboid-like-motility-via-GEF-H1RhoAROCKMyosin-light-chain-signaling-in-MKN45-1471-2407-12-469-S1.ogv http://commons.wikimedia.org/wiki/File:Vincristine-enhances-amoeboid-like-motility-via-GEF-H1RhoAROCKMyosin-light-chain-signaling-in-MKN45-1471-2407-12-469-S2.ogv http://commons.wikimedia.org/wiki/File:Vincristine-enhances-amoeboid-like-motility-via-GEF-H1RhoAROCKMyosin-light-chain-signaling-in-MKN45-1471-2407-12-469-S3.ogv .

Daniel-Mietchen commented 11 years ago

Perhaps we can use this as an occasion to rethink the way categories are dealt with.

For many articles, the XML already contains a hierarchy of tags, of which we should perhaps just use the innermost ones.

For the example of http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pmc&id=3480456 , this would give Biological Fluid Mechanics Behavioral Ecology Entomology Ichthyology Image Processing Biomechanics