wpoa / open-access-media-importer

A tool for harvesting media files from Open Access articles for upload into Wikimedia Commons
http://commons.wikimedia.org/wiki/User:Open_Access_Media_Importer_Bot
23 stars 8 forks source link

Set up a whitelist for oami_pmc_pmcid_import #57

Closed Daniel-Mietchen closed 11 years ago

Daniel-Mietchen commented 11 years ago

As per https://github.com/erlehmann/open-access-media-importer/issues/52#issuecomment-10643624

erlehmann commented 11 years ago

DOI_PREFIX_WHITELIST now has the following content:

10.1155
10.1186
10.1371
10.3389
10.3897
10.7554

I am assuming this means to discard everything not on the list, but resume processing for those in case a whitelist is given.

Daniel-Mietchen commented 11 years ago

Yes, that would be the expected behaviour, but that's not what I see:

daniel@oami-host:~/open-access-media-importer$ git pull remote: Counting objects: 8, done. remote: Compressing objects: 100% (3/3), done. remote: Total 6 (delta 3), reused 5 (delta 2) Unpacking objects: 100% (6/6), done. From git://github.com/erlehmann/open-access-media-importer f6635ba..2ab38e5 master -> origin/master Updating f6635ba..2ab38e5 Fast-forward DOI_PREFIX_WHITELIST | 6 ++++++ plot-helper | 2 +- 2 files changed, 7 insertions(+), 1 deletion(-) create mode 100644 DOI_PREFIX_WHITELIST daniel@oami-host:~/open-access-media-importer$ for ((pmcid=3491706; pmcid>=17; pmcid--)) ; do echo $pmcid | ./oami_pmc_pmcid_import; done ; Removing “/home/daniel/.local/share/open-access-media-importer/pmc_pmcid.sqlite” … done. Input PMCIDs, delimited by whitespace: Downloading “http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pmc&id=3491706”, saving into directory “/home/daniel/.cache/open-access-media-importer/metadata/raw/pmc_pmcid” … 100% |#########################################################################| Cellular Microbiology 2012 Trafficking and release of Leishmania metacyclic HASPB on macrophage invasion /usr/lib/python2.7/dist-packages/sqlalchemy/engine/default.py:463: SAWarning: Unicode type received non-unicode bind param value. param.append(processorskey) “Trafficking and release of Leishmania metacyclic HASPB on macrophage invasion”: 5 × video/mp4 5 × image/tiff 1 × application/msword

Checking MIME types … 100% |#########################################################################| Skipping http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3491706/bin/cmi0014-0740-SD7.mp4, already exists at http://commons.wikimedia.org/w/api.php. Skipping http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3491706/bin/cmi0014-0740-SD8.mp4, already exists at http://commons.wikimedia.org/w/api.php. Skipping http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3491706/bin/cmi0014-0740-SD9.mp4, already exists at http://commons.wikimedia.org/w/api.php. Skipping http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3491706/bin/cmi0014-0740-SD10.mp4, already exists at http://commons.wikimedia.org/w/api.php. Skipping http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3491706/bin/cmi0014-0740-SD11.mp4, already exists at http://commons.wikimedia.org/w/api.php. Removing “/home/daniel/.local/share/open-access-media-importer/pmc_pmcid.sqlite” … done. Input PMCIDs, delimited by whitespace: Downloading “http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pmc&id=3491705”, saving into directory “/home/daniel/.cache/open-access-media-importer/metadata/raw/pmc_pmcid” … 100% |#########################################################################| Biophysical Journal 2012 Force Spectroscopy with Dual-Trap Optical Tweezers: Molecular Stiffness Measurements and Coupled Fluctuations Analysis Unknown copyright statement: © 2012 by the Biophysical Society. /usr/lib/python2.7/dist-packages/sqlalchemy/engine/default.py:463: SAWarning: Unicode type received non-unicode bind param value. param.append(processorskey) Checking MIME types … No materials found.

Daniel-Mietchen commented 11 years ago

In the example above, I would expect both PMCID 3491706 and 3491705 to display something like "Publisher's DOI prefix (10.XXXX) is not on whitelist - skipping."

erlehmann commented 11 years ago

Well, because the issue is not fixed, yet.

erlehmann commented 11 years ago

In your configuration file, leave a section like this:

[whitelist]
doi = 10.1155 10.1186 10.1371 10.3389 10.3897 10.7554
erlehmann commented 11 years ago

Fixed by c5a686a2999777c1e6741208e648d88dd2f5f59e.