Closed Daniel-Mietchen closed 11 years ago
In terms of a time frame, I would strongly prefer to have this working before the end of this month. Having it in the next two weeks or so is not necessary, though, so I have not labeled it as "Do it now" for the time being.
http://www.ncbi.nlm.nih.gov/pmc/tools/oa-service/ returns
You have requested a page which is not open to the public. Your request did not meet the criteria required to grant access to this page.. http://www.pubmedcentral.nih.gov/utils/oa/oa.fcgi?from=2013-09-02+06:00:00&until=2013-09-02+07:00:00&format=tgz returns that as well.
Works with text browser elinks.
Works fine for me on Chrome under Ubuntu.
I wrote a very thin wrapper that wraps the date functionality of the API and returns the PMC IDs of relevant items.
./oa-pmc-ids 2013-09-05 | ./oami_pmc_pmcid_importshould now work.
Daniel, is this something you can use? If so, I will refine it and introduce command line arguments for retrieving the PMC IDs of stuff from the last day/week/month and so on.
Does not work:
danielmietchen@files:~/open-access-media-importer$ ./oa-pmc-ids 2013-09-05 | ./oami_pmc_pmcid_import
Traceback (most recent call last):
File "./oa-pmc-ids", line 4, in <module>
from isodate import parse_date
ImportError: No module named isodate
Input PMCIDs, delimited by whitespace: Traceback (most recent call last):
File "./oa-get", line 161, in <module>
for result in source_module.download_metadata(source_path):
File "/home/danielmietchen/open-access-media-importer/sources/pmc_pmcid.py", line 23, in download_metadata
raise RuntimeError, 'No PMCIDs found.'
RuntimeError: No PMCIDs found.
I used the isodate module as it was fastest to do so. Will rewrite the code so it uses the standard datetime facilities.
Daniel, does it work as of commit affb797698373f474452fc3acd6b9e1a15e8b45d?
Nope - "no module named requests".
I have installed python-requests now. Please try again :)
Nope. Same error, even after logging in to the server anew. Somehow, requests is not visible to my instance.
danielmietchen@files:~$ cd open-access-media-importer/
danielmietchen@files:~/open-access-media-importer$ ./oa-pmc-ids 2013-09-01 | ./oami_pmc_pmcid_import
Traceback (most recent call last):
File "./oa-pmc-ids", line 5, in <module>
from requests import get
ImportError: No module named requests
Input PMCIDs, delimited by whitespace: Traceback (most recent call last):
File "./oa-get", line 161, in <module>
for result in source_module.download_metadata(source_path):
File "/home/danielmietchen/open-access-media-importer/sources/pmc_pmcid.py", line 23, in download_metadata
raise RuntimeError, 'No PMCIDs found.'
RuntimeError: No PMCIDs found.
Oh, sorry, wrong server ;) Should work now...
OK,
./oa-pmc-ids 2013-09-01 | ./oami_pmc_pmcid_import
is running now, and I will do a few more days to watch out for problems.
A few minutes later:
Checking MIME types …
84 of 1197 7% |#######
That does not look like an ideal solution. Will keep it running, though.
I have updated the tool to take “--from” and “--until” arguments and given it basic command line argument facilities as of commit f843bae3d09d8c6657f27851a02548c54dea9506:
1067 open-access-media-importer:master? % ./oa-pmc-ids --help usage: oa-pmc-ids [-h] [--from FROM] [--until UNTIL] List PMC IDs for articles in the PubMed Central Open Access subset. optional arguments: -h, --help show this help message and exit --from FROM Only list articles updated on or after the specified date (YYYY-MM-DD). --until UNTIL Only list articles updated before the specified date (YYYY- MM-DD). Caveat: All dates are given in local time in Bethesda, Maryland: either EST (-05:00) or EDT (-04:00), depending on the time of year.
What is missing right now is a continuation if more than 1000 PMCIDs are returned.
As of commit dd3bb656af5867714bcea3e4b27743856b931ebc, oa-pmc-ids can fetch more than 1000 records via resumption URLs:
1089 open-access-media-importer:master+? % ./oa-pmc-ids --from 2013-08-27 --until 2013-09-05 | wc -w 8550
OK, the run mentioned at https://github.com/erlehmann/open-access-media-importer/issues/94#issuecomment-23900660 is over, and it has brought files from one article, e.g. https://commons.wikimedia.org/wiki/File:In-Vivo-Imaging-of-Trypanosome-Brain-Interactions-and-Development-of-a-Rapid-Screening-Test-for-pntd.0002384.s008.ogv .
Tried
./oa-pmc-ids --from 2013-08-27 --until 2013-09-05 | ./oami_pmc_pmcid_import
just now, which loaded pages and pages of PMCIDs and then stopped with
, saving into directory “/home/danielmietchen/.cache/open-access-media-importer/metadata/raw/pmc_pmcid” …
Traceback (most recent call last): |
File "./oa-get", line 161, in <module>
for result in source_module.download_metadata(source_path):
File "/home/danielmietchen/open-access-media-importer/sources/pmc_pmcid.py", line 46, in download_metadata
content = _get_file_from_pmcids(chunk)
File "/home/danielmietchen/open-access-media-importer/sources/pmc_doi.py", line 41, in _get_file_from_pmcids
xml_file = _get_file_from_url(url)
File "/home/danielmietchen/open-access-media-importer/sources/pmc_doi.py", line 17, in _get_file_from_url
remote_file = urlopen(req)
File "/usr/lib/python2.7/urllib2.py", line 127, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib/python2.7/urllib2.py", line 401, in open
response = self._open(req, data)
File "/usr/lib/python2.7/urllib2.py", line 419, in _open
'_open', req)
File "/usr/lib/python2.7/urllib2.py", line 379, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 1211, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/usr/lib/python2.7/urllib2.py", line 1184, in do_open
r = h.getresponse(buffering=True)
File "/usr/lib/python2.7/httplib.py", line 1034, in getresponse
response.begin()
File "/usr/lib/python2.7/httplib.py", line 407, in begin
version, status, reason = self._read_status()
File "/usr/lib/python2.7/httplib.py", line 371, in _read_status
raise BadStatusLine(line)
httplib.BadStatusLine: ''
Just did
./oa-pmc-ids --from 2013-09-02 --until 2013-09-03 | ./oami_pmc_pmcid_import
and
./oa-pmc-ids --from 2013-09-04 --until 2013-09-05 | ./oami_pmc_pmcid_import
Both seem to have worked fine, even though all relevant files had already been uploaded before.
Played around a little more, and it seems to work OK, with two caveats:
./oa-pmc-ids --from 2013-08-04 --until 2013-08-04 | ./oami_pmc_pmcid_import
gives
Input PMCIDs, delimited by whitespace: Traceback (most recent call last):
File "./oa-get", line 161, in <module>
for result in source_module.download_metadata(source_path):
File "/home/danielmietchen/open-access-media-importer/sources/pmc_pmcid.py", line 23, in download_metadata
raise RuntimeError, 'No PMCIDs found.'
RuntimeError: No PMCIDs found.
2: problems with the MediaWiki API (which have been around for long but happened only occasionally) are now very frequent - almost all runs of tasks like
./oa-pmc-ids --from 2013-08-15 --until 2013-08-16 | ./oami_pmc_pmcid_import
end with
Mediawiki API request failed, retrying.
Traceback (most recent call last):
File "./oa-get", line 187, in <module>
if mediawiki.is_uploaded(material):
File "/home/danielmietchen/open-access-media-importer/helpers/mediawiki.py", line 60, in is_uploaded
result = query(params)
File "/home/danielmietchen/open-access-media-importer/helpers/mediawiki.py", line 17, in query
return query(request)
File "/home/danielmietchen/open-access-media-importer/helpers/mediawiki.py", line 12, in query
request = wikitools.api.APIRequest(wiki, params)
File "/home/danielmietchen/open-access-media-importer/helpers/wikitools/api.py", line 61, in __init__
self.data = data.copy()
AttributeError: APIRequest instance has no attribute 'copy'
I think you can get PMC IDs for exactly one day by just specifying the next day for
--until, like
./oa-pmc-ids --from 2013-08-04 --until 2013-08-05. If I understand http://www.ncbi.nlm.nih.gov/pmc/tools/oa-service/, the interval does not include the second date. I will test my assumptions after sleep.
OK, just tested it, and your assumption seems correct.
I am now getting 502, 503 or 404 errors when accessing links like
that are used in calls like
./oa-pmc-ids --from 2013-08-01 --until 2013-08-02 | ./oami_pmc_pmcid_import
Same result for single PMCIDs: http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pmc&id=PMC3729697 http://www.webcitation.org/6JQ2dSSXG
I have asked them what the matter might be.
All of NCBI was having problems earlier today, but I thought they were resolved. I see that this eutils call is still failing, let me report it.
It works normal again - thanks!
The bundling of large numbers of articles by this approach exposes the existing problems more prominently.
For instance,
./oa-pmc-ids --from 2013-09-06 --until 2013-09-07 | ./oami_pmc_pmcid_import
as well as similar calls for many other days end with an assertion error as in https://github.com/erlehmann/open-access-media-importer/issues/84 , others with conversion errors due to Gstreamer.
These problems would be more easy to handle if oa-pmc-ids would produce an array or list of PMCIDs, over elements of which oami_pmc_pmcid_import would then be looped.
I am irregularly getting 503s at eutils again.
My understanding is that they updated the OS on several core services machines, and are now having intermittent problems with the memory allocation routine when under heavy load. I know they are working on it, but not sure what the latest status is. Hopefully you are not still having problems. If you are, let me know.
Worked fine again since about Thursday.
What still needs addressing is https://github.com/erlehmann/open-access-media-importer/issues/94#issuecomment-24017794 - oami_pmc_pmcid_import should not be run for all PMCIDs found by oa-pmc-ids in one go, but loop over them, so that an issue with one PMCID does not block the processing of all the others in that batch.
Changing to "Do it now" - I plan to switch to this as the primary mode of operation by the end of the month.
eutils is intermittent again - I am sometimes getting valid responses, sometimes 503 or other errors.
So far, the search for newly published articles has still been performed manually by me, and it's now about time to automate that.
PMC's OA Service at http://www.ncbi.nlm.nih.gov/pmc/tools/oa-service/ provides information about which articles have been indexed when.
So we could check - perhaps on an hourly basis - for new articles, e.g. http://www.pubmedcentral.nih.gov/utils/oa/oa.fcgi?from=2013-09-02+06:00:00&until=2013-09-02+07:00:00&format=tgz .
From that, we can get the PMCIDs of the relevant articles, which can then be fed into a variant of oami_pmc_pmcid_import .
I would like this variant (which could just be a command-line option, e.g. -new) to
I am also thinking of separating the crawling, converting and uploading steps now, for which I will open a separate issue as https://github.com/erlehmann/open-access-media-importer/issues/95 .