wpoa / open-access-media-importer

A tool for harvesting media files from Open Access articles for upload into Wikimedia Commons
http://commons.wikimedia.org/wiki/User:Open_Access_Media_Importer_Bot
23 stars 8 forks source link

Cron job to scan for newly published articles #94

Closed Daniel-Mietchen closed 10 years ago

Daniel-Mietchen commented 10 years ago

So far, the search for newly published articles has still been performed manually by me, and it's now about time to automate that.

PMC's OA Service at http://www.ncbi.nlm.nih.gov/pmc/tools/oa-service/ provides information about which articles have been indexed when.

So we could check - perhaps on an hourly basis - for new articles, e.g. http://www.pubmedcentral.nih.gov/utils/oa/oa.fcgi?from=2013-09-02+06:00:00&until=2013-09-02+07:00:00&format=tgz .

From that, we can get the PMCIDs of the relevant articles, which can then be fed into a variant of oami_pmc_pmcid_import .

I would like this variant (which could just be a command-line option, e.g. -new) to

  1. allow for another instance (or even several) of oami_pmc_pmcid_import to be run on "old items" (e.g. when a new DOI prefix is on the whitelist, or when a Gstreamer bug has been fixed)
  2. have a few predefined standard modes of operation for articles indexed over the last hour, day, week and month (e.g. oami_pmc_pmcid_import -new week).

I am also thinking of separating the crawling, converting and uploading steps now, for which I will open a separate issue as https://github.com/erlehmann/open-access-media-importer/issues/95 .

Daniel-Mietchen commented 10 years ago

In terms of a time frame, I would strongly prefer to have this working before the end of this month. Having it in the next two weeks or so is not necessary, though, so I have not labeled it as "Do it now" for the time being.

erlehmann commented 10 years ago

http://www.ncbi.nlm.nih.gov/pmc/tools/oa-service/ returns

You have requested a page which is not open to the public. Your request did not meet the criteria required to grant access to this page. 
. http://www.pubmedcentral.nih.gov/utils/oa/oa.fcgi?from=2013-09-02+06:00:00&until=2013-09-02+07:00:00&format=tgz returns that as well.

erlehmann commented 10 years ago

Works with text browser elinks.

Daniel-Mietchen commented 10 years ago

Works fine for me on Chrome under Ubuntu.

erlehmann commented 10 years ago

I wrote a very thin wrapper that wraps the date functionality of the API and returns the PMC IDs of relevant items.

./oa-pmc-ids 2013-09-05 | ./oami_pmc_pmcid_import
should now work.

Daniel, is this something you can use? If so, I will refine it and introduce command line arguments for retrieving the PMC IDs of stuff from the last day/week/month and so on.

Daniel-Mietchen commented 10 years ago

Does not work:

danielmietchen@files:~/open-access-media-importer$ ./oa-pmc-ids 2013-09-05 | ./oami_pmc_pmcid_import
Traceback (most recent call last):
  File "./oa-pmc-ids", line 4, in <module>
    from isodate import parse_date
ImportError: No module named isodate
Input PMCIDs, delimited by whitespace: Traceback (most recent call last):
  File "./oa-get", line 161, in <module>
    for result in source_module.download_metadata(source_path):
  File "/home/danielmietchen/open-access-media-importer/sources/pmc_pmcid.py", line 23, in download_metadata
    raise RuntimeError, 'No PMCIDs found.'
RuntimeError: No PMCIDs found.
erlehmann commented 10 years ago

I used the isodate module as it was fastest to do so. Will rewrite the code so it uses the standard datetime facilities.

erlehmann commented 10 years ago

Daniel, does it work as of commit affb797698373f474452fc3acd6b9e1a15e8b45d?

Daniel-Mietchen commented 10 years ago

Nope - "no module named requests".

RaphaelWimmer commented 10 years ago

I have installed python-requests now. Please try again :)

Daniel-Mietchen commented 10 years ago

Nope. Same error, even after logging in to the server anew. Somehow, requests is not visible to my instance.

danielmietchen@files:~$ cd open-access-media-importer/
danielmietchen@files:~/open-access-media-importer$ ./oa-pmc-ids 2013-09-01 | ./oami_pmc_pmcid_import
Traceback (most recent call last):
  File "./oa-pmc-ids", line 5, in <module>
    from requests import get
ImportError: No module named requests
Input PMCIDs, delimited by whitespace: Traceback (most recent call last):
  File "./oa-get", line 161, in <module>
    for result in source_module.download_metadata(source_path):
  File "/home/danielmietchen/open-access-media-importer/sources/pmc_pmcid.py", line 23, in download_metadata
    raise RuntimeError, 'No PMCIDs found.'
RuntimeError: No PMCIDs found.
RaphaelWimmer commented 10 years ago

Oh, sorry, wrong server ;) Should work now...

Daniel-Mietchen commented 10 years ago

OK,

./oa-pmc-ids 2013-09-01 | ./oami_pmc_pmcid_import

is running now, and I will do a few more days to watch out for problems.

Daniel-Mietchen commented 10 years ago

A few minutes later:

Checking MIME types …
84 of 1197   7% |#######

That does not look like an ideal solution. Will keep it running, though.

erlehmann commented 10 years ago

I have updated the tool to take “--from” and “--until” arguments and given it basic command line argument facilities as of commit f843bae3d09d8c6657f27851a02548c54dea9506:

1067 open-access-media-importer:master? % ./oa-pmc-ids --help
usage: oa-pmc-ids [-h] [--from FROM] [--until UNTIL]
List PMC IDs for articles in the PubMed Central Open Access subset.
optional arguments:
  -h, --help     show this help message and exit
  --from FROM    Only list articles updated on or after the specified date
                 (YYYY-MM-DD).
  --until UNTIL  Only list articles updated before the specified date (YYYY-
                 MM-DD).
Caveat: All dates are given in local time in Bethesda, Maryland: either EST
(-05:00) or EDT (-04:00), depending on the time of year.
erlehmann commented 10 years ago

What is missing right now is a continuation if more than 1000 PMCIDs are returned.

erlehmann commented 10 years ago

As of commit dd3bb656af5867714bcea3e4b27743856b931ebc, oa-pmc-ids can fetch more than 1000 records via resumption URLs:

1089 open-access-media-importer:master+? % ./oa-pmc-ids --from 2013-08-27 --until 2013-09-05 | wc -w
8550
Daniel-Mietchen commented 10 years ago

OK, the run mentioned at https://github.com/erlehmann/open-access-media-importer/issues/94#issuecomment-23900660 is over, and it has brought files from one article, e.g. https://commons.wikimedia.org/wiki/File:In-Vivo-Imaging-of-Trypanosome-Brain-Interactions-and-Development-of-a-Rapid-Screening-Test-for-pntd.0002384.s008.ogv .

Tried

./oa-pmc-ids --from 2013-08-27 --until 2013-09-05 | ./oami_pmc_pmcid_import

just now, which loaded pages and pages of PMCIDs and then stopped with

, saving into directory “/home/danielmietchen/.cache/open-access-media-importer/metadata/raw/pmc_pmcid” …
Traceback (most recent call last):                                                                                                             |
  File "./oa-get", line 161, in <module>
    for result in source_module.download_metadata(source_path):
  File "/home/danielmietchen/open-access-media-importer/sources/pmc_pmcid.py", line 46, in download_metadata
    content = _get_file_from_pmcids(chunk)
  File "/home/danielmietchen/open-access-media-importer/sources/pmc_doi.py", line 41, in _get_file_from_pmcids
    xml_file = _get_file_from_url(url)
  File "/home/danielmietchen/open-access-media-importer/sources/pmc_doi.py", line 17, in _get_file_from_url
    remote_file = urlopen(req)
  File "/usr/lib/python2.7/urllib2.py", line 127, in urlopen
    return _opener.open(url, data, timeout)
  File "/usr/lib/python2.7/urllib2.py", line 401, in open
    response = self._open(req, data)
  File "/usr/lib/python2.7/urllib2.py", line 419, in _open
    '_open', req)
  File "/usr/lib/python2.7/urllib2.py", line 379, in _call_chain
    result = func(*args)
  File "/usr/lib/python2.7/urllib2.py", line 1211, in http_open
    return self.do_open(httplib.HTTPConnection, req)
  File "/usr/lib/python2.7/urllib2.py", line 1184, in do_open
    r = h.getresponse(buffering=True)
  File "/usr/lib/python2.7/httplib.py", line 1034, in getresponse
    response.begin()
  File "/usr/lib/python2.7/httplib.py", line 407, in begin
    version, status, reason = self._read_status()
  File "/usr/lib/python2.7/httplib.py", line 371, in _read_status
    raise BadStatusLine(line)
httplib.BadStatusLine: ''
Daniel-Mietchen commented 10 years ago

Just did

./oa-pmc-ids --from 2013-09-02 --until 2013-09-03 | ./oami_pmc_pmcid_import

and

./oa-pmc-ids --from 2013-09-04 --until 2013-09-05 | ./oami_pmc_pmcid_import

Both seem to have worked fine, even though all relevant files had already been uploaded before.

Daniel-Mietchen commented 10 years ago

Played around a little more, and it seems to work OK, with two caveats:

  1. I do not see a way to do just one day now:
./oa-pmc-ids --from 2013-08-04 --until 2013-08-04 | ./oami_pmc_pmcid_import

gives

Input PMCIDs, delimited by whitespace: Traceback (most recent call last):
  File "./oa-get", line 161, in <module>
    for result in source_module.download_metadata(source_path):
  File "/home/danielmietchen/open-access-media-importer/sources/pmc_pmcid.py", line 23, in download_metadata
    raise RuntimeError, 'No PMCIDs found.'
RuntimeError: No PMCIDs found.

2: problems with the MediaWiki API (which have been around for long but happened only occasionally) are now very frequent - almost all runs of tasks like

./oa-pmc-ids --from 2013-08-15 --until 2013-08-16 | ./oami_pmc_pmcid_import

end with

Mediawiki API request failed, retrying.
Traceback (most recent call last):
  File "./oa-get", line 187, in <module>
    if mediawiki.is_uploaded(material):
  File "/home/danielmietchen/open-access-media-importer/helpers/mediawiki.py", line 60, in is_uploaded
    result = query(params)
  File "/home/danielmietchen/open-access-media-importer/helpers/mediawiki.py", line 17, in query
    return query(request)
  File "/home/danielmietchen/open-access-media-importer/helpers/mediawiki.py", line 12, in query
    request = wikitools.api.APIRequest(wiki, params)
  File "/home/danielmietchen/open-access-media-importer/helpers/wikitools/api.py", line 61, in __init__
    self.data = data.copy()
AttributeError: APIRequest instance has no attribute 'copy'
erlehmann commented 10 years ago

I think you can get PMC IDs for exactly one day by just specifying the next day for

--until
, like
./oa-pmc-ids --from 2013-08-04 --until 2013-08-05
. If I understand http://www.ncbi.nlm.nih.gov/pmc/tools/oa-service/, the interval does not include the second date. I will test my assumptions after sleep.

Daniel-Mietchen commented 10 years ago

OK, just tested it, and your assumption seems correct.

Daniel-Mietchen commented 10 years ago

I am now getting 502, 503 or 404 errors when accessing links like

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pmc&id=PMC3729697&id=PMC3729697&id=PMC3729698&id=PMC3729698&id=PMC3729701&id=PMC3729701&id=PMC3729709&id=PMC3729709&id=PMC3729710&id=PMC3729710&id=PMC3729711&id=PMC3729711&id=PMC3651006&id=PMC3717781&id=PMC3729712&id=PMC3729712&id=PMC3729713&id=PMC3729713&id=PMC3729714&id=PMC3729714&id=PMC3511636&id=PMC3402253&id=PMC3465820&id=PMC3402270&id=PMC3402256&id=PMC3402273&id=PMC3713381&id=PMC3465828&id=PMC3713384&id=PMC3465819&id=PMC3465822&id=PMC3402310&id=PMC3402304&id=PMC3402307&id=PMC3688188&id=PMC3726937&id=PMC3478411&id=PMC3530630&id=PMC3713389&id=PMC3402312&id=PMC3361001&id=PMC3402332&id=PMC3402315&id=PMC3402306&id=PMC3610836&id=PMC3402323&id=PMC3552073&id=PMC3469749&id=PMC3557305&id=PMC3402298&id=PMC3557313&id=PMC3355245&id=PMC3402318&id=PMC3402275&id=PMC3402261&id=PMC3688102&id=PMC3709642&id=PMC3672996&id=PMC3722482&id=PMC3402309&id=PMC3713380&id=PMC3402704&id=PMC3709092&id=PMC3374053&id=PMC3614090&id=PMC3570738&id=PMC3610864&id=PMC3402308&id=PMC3713385&id=PMC3693823&id=PMC3402328&id=PMC3402271&id=PMC3713388&id=PMC3560304&id=PMC3547199&id=PMC3519975&id=PMC3551985&id=PMC3552031&id=PMC3565055&id=PMC3557753&id=PMC3557309&id=PMC3570671&id=PMC3694309&id=PMC3605914&id=PMC3689263&id=PMC3573219&id=PMC3402302&id=PMC3580713&id=PMC3402325&id=PMC3557312&id=PMC3402244&id=PMC3402242&id=PMC3402267&id=PMC3579950&id=PMC3557306&id=PMC3465830&id=PMC3567215&id=PMC3374051&id=PMC3402327&id=PMC3465821&id=PMC3402263&id=PMC3402246&id=PMC3387511&id=PMC3402329&id=PMC3711237&id=PMC3465829&id=PMC3402320&id=PMC3721121&id=PMC3568452&id=PMC3682191&id=PMC3721115&id=PMC3402331&id=PMC3402252&id=PMC3589714&id=PMC3565048&id=PMC3402266&id=PMC3402243&id=PMC3560316&id=PMC3402260&id=PMC3719096&id=PMC3690523&id=PMC3713386&id=PMC3402317&id=PMC3529982&id=PMC3557310&id=PMC3402268&id=PMC3565030&id=PMC3557311&id=PMC3514565&id=PMC3415469&id=PMC3402250&id=PMC3402279&id=PMC3402262&id=PMC3402316&id=PMC3402305&id=PMC3402322&id=PMC3402276&id=PMC3725421&id=PMC3532566&id=PMC3402330&id=PMC3402247&id=PMC3402264&id=PMC3565066&id=PMC3709640&id=PMC3682192&id=PMC3711238&id=PMC3717206&id=PMC3552148&id=PMC3713390&id=PMC3530628&id=PMC3714437&id=PMC3402255&id=PMC3565103&id=PMC3647680&id=PMC3565060&id=PMC3381866&id=PMC3511638&id=PMC3570742&id=PMC3708128&id=PMC3561505&id=PMC3713387&id=PMC3383881&id=PMC3684770&id=PMC3713383&id=PMC3724054&id=PMC3719504&id=PMC3568245&id=PMC3546159&id=PMC3679924&id=PMC3552084&id=PMC3402297&id=PMC3519961&id=PMC3402314&id=PMC3402254&id=PMC3565016&id=PMC3402257&id=PMC3402245&id=PMC3402265&id=PMC3402248&id=PMC3565010&id=PMC3402258&id=PMC3402278&id=PMC3714436&id=PMC3465824&id=PMC3557308&id=PMC3566319&id=PMC3566333&id=PMC3610816&id=PMC3540863&id=PMC3596117&id=PMC3402274&id=PMC3557617&id=PMC3717539&id=PMC3546144&id=PMC3402303&id=PMC3402326&id=PMC3522572&id=PMC3688101&id=PMC3715700&id=PMC3402311&id=PMC3710969&id=PMC3380161&id=PMC3721122&id=PMC3349796&id=PMC3387354&id=PMC3685218&id=PMC3402296&id=PMC3402313&id=PMC3521079&id=PMC3713379&id=PMC3560312&id=PMC3402259&id=PMC3598632&id=PMC3677091&id=PMC3713382&id=PMC3402299&id=PMC3465825&id=PMC3557314&id=PMC3402321&id=PMC3664012&id=PMC3402324&id=PMC3610845&id=PMC3402272&id=PMC3561482&id=PMC3509239&id=PMC3565028&id=PMC3540847&id=PMC3556371&id=PMC3402301&id=PMC3465823&id=PMC3557304&id=PMC3557307&id=PMC3709093&id=PMC3552087&id=PMC3721032&id=PMC3370147&id=PMC3721114&id=PMC3661409&id=PMC3323740&id=PMC3416017&id=PMC3465827&id=PMC3715702&id=PMC3345295&id=PMC3469751&id=PMC3402300&id=PMC3556932&id=PMC3402249&id=PMC3567218&id=PMC3402251&id=PMC3402269&id=PMC3465826&id=PMC3529859&id=PMC3721108&id=PMC3582403&id=PMC3725829&id=PMC3726041&id=PMC3726292&id=PMC3726340&id=PMC3726358&id=PMC3726357&id=PMC3726356&id=PMC3726355&id=PMC3726326&id=PMC3726323&id=PMC3726329&id=PMC3726335&id=PMC3726330&id=PMC3726341&id=PMC3726336&id=PMC3726354&id=PMC3726351&id=PMC3726369&id=PMC3726375&id=PMC3726404&id=PMC3726422&id=PMC3726395&id=PMC3726394&id=PMC3726400&id=PMC3726398&id=PMC3726416&id=PMC3726413&id=PMC3726428&id=PMC3726454&id=PMC3726427&id=PMC3726469&id=PMC3726473&id=PMC3726483&id=PMC3726484&id=PMC3728583&id=PMC3728574&id=PMC3728572&id=PMC3728578&id=PMC3728581&id=PMC3728576&id=PMC3728584&id=PMC3728585&id=PMC3728577&id=PMC3728579&id=PMC3728575&id=PMC3728580&id=PMC3728573&id=PMC3728582&id=PMC3728571&id=PMC3725666&id=PMC3725695&id=PMC3724562&id=PMC3727950&id=PMC3727988&id=PMC3728002&id=PMC3728007&id=PMC3728011&id=PMC3728229&id=PMC3720986&id=PMC3720988&id=PMC3720989&id=PMC3720990&id=PMC3720994&id=PMC3720992&id=PMC3721000&id=PMC3720987&id=PMC3720995&id=PMC3578983&id=PMC3720993&id=PMC3722367&id=PMC3722371&id=PMC3722368&id=PMC3722366&id=PMC3722369&id=PMC3722374&id=PMC3722377&id=PMC3722378&id=PMC3722388&id=PMC3722380&id=PMC3722382&id=PMC3722379&id=PMC3722375&id=PMC3722372&id=PMC3722383&id=PMC3722435&id=PMC3722373&id=PMC3722437&id=PMC3722439&id=PMC3722444&id=PMC3722447&id=PMC3722450&id=PMC3722454&id=PMC3722453&id=PMC3722457&id=PMC3722458&id=PMC3722456&id=PMC3722376&id=PMC3722384&id=PMC3722452&id=PMC3722445&id=PMC3722370&id=PMC3722385&id=PMC3722462&id=PMC3722387&id=PMC3722455&id=PMC3722436&id=PMC3722449&id=PMC3722381&id=PMC3722460&id=PMC3725221&id=PMC3725661

that are used in calls like

./oa-pmc-ids --from 2013-08-01 --until 2013-08-02 | ./oami_pmc_pmcid_import

Same result for single PMCIDs: http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pmc&id=PMC3729697 http://www.webcitation.org/6JQ2dSSXG

I have asked them what the matter might be.

Klortho commented 10 years ago

All of NCBI was having problems earlier today, but I thought they were resolved. I see that this eutils call is still failing, let me report it.

Daniel-Mietchen commented 10 years ago

It works normal again - thanks!

Daniel-Mietchen commented 10 years ago

The bundling of large numbers of articles by this approach exposes the existing problems more prominently.

For instance,

 ./oa-pmc-ids --from 2013-09-06 --until 2013-09-07 | ./oami_pmc_pmcid_import

as well as similar calls for many other days end with an assertion error as in https://github.com/erlehmann/open-access-media-importer/issues/84 , others with conversion errors due to Gstreamer.

These problems would be more easy to handle if oa-pmc-ids would produce an array or list of PMCIDs, over elements of which oami_pmc_pmcid_import would then be looped.

Daniel-Mietchen commented 10 years ago

I am irregularly getting 503s at eutils again.

Klortho commented 10 years ago

My understanding is that they updated the OS on several core services machines, and are now having intermittent problems with the memory allocation routine when under heavy load. I know they are working on it, but not sure what the latest status is. Hopefully you are not still having problems. If you are, let me know.

Daniel-Mietchen commented 10 years ago

Worked fine again since about Thursday.

Daniel-Mietchen commented 10 years ago

What still needs addressing is https://github.com/erlehmann/open-access-media-importer/issues/94#issuecomment-24017794 - oami_pmc_pmcid_import should not be run for all PMCIDs found by oa-pmc-ids in one go, but loop over them, so that an issue with one PMCID does not block the processing of all the others in that batch.

Daniel-Mietchen commented 10 years ago

Changing to "Do it now" - I plan to switch to this as the primary mode of operation by the end of the month.

Daniel-Mietchen commented 10 years ago

eutils is intermittent again - I am sometimes getting valid responses, sometimes 503 or other errors.

Daniel-Mietchen commented 10 years ago

Moved the part from https://github.com/erlehmann/open-access-media-importer/issues/94#issuecomment-24017794 to https://github.com/erlehmann/open-access-media-importer/issues/111 and closing.