Closed Daniel-Mietchen closed 10 years ago
Just ran
./oa-pmc-ids --from 2013-10-08 --until 2013-10-13 | ./oami_pmc_pmcid_import
which gave
...
Skipping <http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3786892/bin/pone.0076065.s001.avi>, already exists at <https://commons.wikimedia.org/w/api.php>.
Unknown, possibly non-free license: <None>
Skipping <http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3786914/bin/pone.0075952.s001.mpg>, already exists at <https://commons.wikimedia.org/w/api.php>.
Traceback (most recent call last):
File "./oa-get", line 187, in <module>
if mediawiki.is_uploaded(material):
File "/home/danielmietchen/open-access-media-importer/helpers/mediawiki.py", line 80, in is_uploaded
assert len(first_sentence_of_caption) > 0
AssertionError
This error is the subject of https://github.com/erlehmann/open-access-media-importer/issues/84 - the point here in https://github.com/erlehmann/open-access-media-importer/issues/111 is that such an individual error should not stop the processing of other articles, but it currently does.
Moreover, the current handling of the looping does not allow me to find out which article had the "Unknown, possibly non-free license:
I assume we can use xargs.
Does using a read loop fix the problem? If so, i'll add it to she shell script.
./oa-pmc-ids --from 2013-10-14 --until 2013-10-15 | tr ' ' '\n' | while read -r; do echo $REPLY | ./oami_pmc_pmcid_import; done;
Running it now. Looks good so far.
No problem so far. Running
./oa-cache clear-database pmc_pmcid | ./oa-pmc-ids --from 2013-10-04 --until 2013-10-15 | tr ' ' '\n' | while read -r; do echo $REPLY | ./oami_pmc_pmcid_import; ./oa-cache clear-database pmc_pmcid; done;
now.
Got stuck with https://github.com/erlehmann/open-access-media-importer/issues/18
Removing “/home/danielmietchen/.local/share/open-access-media-importer/pmc_pmcid.sqlite” … done.
Input PMCIDs, delimited by whitespace: Removing “/home/danielmietchen/.cache/open-access-media-importer/metadata/raw/pmc_pmcid/efetch.fcgi0” … done.
Downloading “http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pmc&id=PMC3783376”, saving into directory “/home/danielmietchen/.cache/open-access-media-importer/metadata/raw/pmc_pmcid” …
100% |#########################################################################################################################################|
PLOS ONE 2013
Vocal Recruitment for Joint Travel in Wild Chimpanzees
/usr/lib/python2.7/dist-packages/sqlalchemy/engine/default.py:463: SAWarning: Unicode type received non-unicode bind param value.
param.append(processors[key](compiled_params[key]))
“Vocal Recruitment for Joint Travel in Wild Chimpanzees”:
2 × /
Checking MIME types …
DOI 10.1371/journal.pone.0076073, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3783376/bin/pone.0076073.s001.wmv, source claimed / but is video/x-ms-asf.
/usr/lib/python2.7/dist-packages/sqlalchemy/engine/default.py:463: SAWarning: Unicode type received non-unicode bind param value.
param.append(processors[key](compiled_params[key]))
2 of 2 100% |###################################################################################################################| Time: 00:00:02
DOI 10.1371/journal.pone.0076073, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3783376/bin/pone.0076073.s002.pdf, source claimed / but is application/pdf.
Skipping download of <http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3783376/bin/pone.0076073.s001.wmv>.
Converting “/home/danielmietchen/.cache/open-access-media-importer/media/raw/pmc_pmcid/http%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2FPMC3783376%2Fbin%2Fpone.0076073.s001.wmv”, saving into “/home/danielmietchen/.cache/open-access-media-importer/media/refined/pmc_pmcid/http%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2FPMC3783376%2Fbin%2Fpone.0076073.s001.wmv.ogg” … 99% |##
so we will have to use timeout perhaps.
Just set up a cron job
22 7 * * * sh oami_pmc_pmcid_import.sh
with oami_pmc_pmcid_import.sh
consisting of
#!/bin/bash
TIMEOUTFILE=timeout_pmcid.txt
# clear cache
./oa-cache clear-database pmc_pmcid
for pmcid in $(./oa-pmc-ids --from $(date +"%F" -d '2 days ago') --until $(date +"%F")); do
date
timeout 6h sh -c "echo $pmcid | ./oami_pmc_pmcid_import"
if [[ $? == 124 ]]; then
echo "------------------ Timed out! --------------------"
echo $pmcid >> "$TIMEOUTFILE"
fi
./oa-cache clear-database pmc_pmcid
done;
Daniel, if i put the above code into the repository, can the bug be closed?
I am now using a cronjob
12 6 * * * cd ~/open-access-media-importer; sh oami_pmc_pmcid_import.sh | tee -a oami_pmc_pmcid_import.tee
where oami_pmc_pmcid_import.sh is
#!/bin/bash
TIMEOUTFILE=timeout_pmcid.txt
# clear cache
./oa-cache clear-database pmc_pmcid
for pmcid in $(./oa-pmc-ids --from $(date +"%F" -d '3 days ago') --until $(date +"%F")); do
date
timeout 6h sh -c "echo $pmcid | ./oami_pmc_pmcid_import"
if [[ $? == 124 ]]; then
echo "------------------ Timed out! --------------------"
echo $pmcid >> "$TIMEOUTFILE"
fi
./oa-cache clear-database pmc_pmcid
done;
This works fine, so there is no need for your workaround in https://github.com/erlehmann/open-access-media-importer/issues/111#issuecomment-26286139 , and I am closing this issue.
For proper logging:
12 6 * * * cd ~/open-access-media-importer; sh oami_pmc_pmcid_import.sh 2>&1 | tee -a oami_pmc_pmcid_import.tee
Moved here from https://github.com/erlehmann/open-access-media-importer/issues/94#issuecomment-24017794 .