Open Daniel-Mietchen opened 11 years ago
This is easy, but please list some test cases so I can verify my solution.
Problem is that I don't have an easy way to tell whether an article had single-word categories that we stripped off.
What I could provide is an incomplete list of articles whose files ended up having no content categories upon upload.
Latest example: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3522013/ , giving rise to http://commons.wikimedia.org/wiki/File:Vincristine-enhances-amoeboid-like-motility-via-GEF-H1RhoAROCKMyosin-light-chain-signaling-in-MKN45-1471-2407-12-469-S1.ogv http://commons.wikimedia.org/wiki/File:Vincristine-enhances-amoeboid-like-motility-via-GEF-H1RhoAROCKMyosin-light-chain-signaling-in-MKN45-1471-2407-12-469-S2.ogv http://commons.wikimedia.org/wiki/File:Vincristine-enhances-amoeboid-like-motility-via-GEF-H1RhoAROCKMyosin-light-chain-signaling-in-MKN45-1471-2407-12-469-S3.ogv .
Perhaps we can use this as an occasion to rethink the way categories are dealt with.
For many articles, the XML already contains a hierarchy of
For the example of http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pmc&id=3480456 , this would give Biological Fluid Mechanics Behavioral Ecology Entomology Ichthyology Image Processing Biomechanics
We currently strip off single-word categories. However, about 600 of the files uploaded so far had no content categories at all, which is much harder to manage than the overcategorization that we have with most of the other files.
So I would suggest that after stripping the single-word categories, we check whether any categories are left, and if there are none, we take the single-worded ones back in.