wpoa / open-access-media-importer

A tool for harvesting media files from Open Access articles for upload into Wikimedia Commons
http://commons.wikimedia.org/wiki/User:Open_Access_Media_Importer_Bot
23 stars 8 forks source link

Some DOIs have parentheses or other special characters... #50

Open Daniel-Mietchen opened 12 years ago

Daniel-Mietchen commented 12 years ago

... which causes the shell to bark.

Examples in 10.1044/1092-4388(2010/09-0106) 10.1044/1092-4388(2009/07-0280) 10.1044/1092-4388(2007/028) . Could be circumvented by having the option to run the bot via PMCID, as per https://github.com/erlehmann/open-access-media-importer/issues/44 .

erlehmann commented 12 years ago

Error is:

1005 open-access-media-importer:master? % echo "10.1044/1092-4388(2010/09-0106)" | ./oami_pmc_doi_import
Removing “/home/erlehmann/.local/share/open-access-media-importer/pmc_doi.sqlite” … done.
Input DOIs, delimited by whitespace: Getting PubMed Central IDs for given DOIs … Traceback (most recent call last):
  File "./oa-get", line 118, in 
    for result in source_module.download_metadata(source_path):
  File "/home/erlehmann/src/open-access-media-importer/sources/pmc_doi.py", line 56, in download_metadata
    raise RuntimeError, 'No PubMed Central IDs for given DOIs found.'
RuntimeError: No PubMed Central IDs for given DOIs found.
erlehmann commented 12 years ago

Problem is not the shell, but esearch. For doi:10.1044/1092-4388(2010/09-0106) the esearch result URL is http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pmc&term=10.1044/1092-4388(2010/09-0106)[doi], which contains nothing of value.

erlehmann commented 12 years ago

“Entrez processes all Boolean operators in a left-to-right sequence. Enclosing individual concepts in parentheses changes this priority.” – http://www.ncbi.nlm.nih.gov/books/NBK3837/#EntrezHelp.Using_Boolean_Operators

erlehmann commented 12 years ago

Can we delegate this?

Daniel-Mietchen commented 12 years ago

I lowered the priority to "Nice to have", since none of the publishers currently foreseen for the whitelist (cf. https://github.com/erlehmann/open-access-media-importer/issues/57 ) use such DOI schemes.