wpoa / open-access-media-importer

A tool for harvesting media files from Open Access articles for upload into Wikimedia Commons
23 stars 8 forks source link

oami_pmc_pmcid_import should loop over the PMCIDs found by oa-pmc-ids #111

Closed Daniel-Mietchen closed 10 years ago

Daniel-Mietchen commented 10 years ago

Moved here from https://github.com/erlehmann/open-access-media-importer/issues/94#issuecomment-24017794 .

Daniel-Mietchen commented 10 years ago

Just ran

./oa-pmc-ids --from 2013-10-08 --until 2013-10-13 | ./oami_pmc_pmcid_import

which gave

Skipping <http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3786892/bin/pone.0076065.s001.avi>, already exists at <https://commons.wikimedia.org/w/api.php>.
Unknown, possibly non-free license: <None>
Skipping <http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3786914/bin/pone.0075952.s001.mpg>, already exists at <https://commons.wikimedia.org/w/api.php>.
Traceback (most recent call last):
  File "./oa-get", line 187, in <module>
    if mediawiki.is_uploaded(material):
  File "/home/danielmietchen/open-access-media-importer/helpers/mediawiki.py", line 80, in is_uploaded
    assert len(first_sentence_of_caption) > 0

This error is the subject of https://github.com/erlehmann/open-access-media-importer/issues/84 - the point here in https://github.com/erlehmann/open-access-media-importer/issues/111 is that such an individual error should not stop the processing of other articles, but it currently does.

Moreover, the current handling of the looping does not allow me to find out which article had the "Unknown, possibly non-free license: " and whether that statement is correct.

erlehmann commented 10 years ago

I assume we can use xargs.

erlehmann commented 10 years ago

Does using a read loop fix the problem? If so, i'll add it to she shell script.

./oa-pmc-ids --from 2013-10-14 --until 2013-10-15 | tr ' ' '\n' | while read -r; do echo $REPLY | ./oami_pmc_pmcid_import; done;
Daniel-Mietchen commented 10 years ago

Running it now. Looks good so far.

Daniel-Mietchen commented 10 years ago

No problem so far. Running

./oa-cache clear-database pmc_pmcid | ./oa-pmc-ids --from 2013-10-04 --until 2013-10-15 | tr ' ' '\n' | while read -r; do echo $REPLY | ./oami_pmc_pmcid_import; ./oa-cache clear-database pmc_pmcid; done;


Daniel-Mietchen commented 10 years ago

Got stuck with https://github.com/erlehmann/open-access-media-importer/issues/18

Removing “/home/danielmietchen/.local/share/open-access-media-importer/pmc_pmcid.sqlite” … done.
Input PMCIDs, delimited by whitespace: Removing “/home/danielmietchen/.cache/open-access-media-importer/metadata/raw/pmc_pmcid/efetch.fcgi0” … done.
Downloading “http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pmc&id=PMC3783376”, saving into directory “/home/danielmietchen/.cache/open-access-media-importer/metadata/raw/pmc_pmcid” …
100% |#########################################################################################################################################|
        Vocal Recruitment for Joint Travel in Wild Chimpanzees
/usr/lib/python2.7/dist-packages/sqlalchemy/engine/default.py:463: SAWarning: Unicode type received non-unicode bind param value.
“Vocal Recruitment for Joint Travel in Wild Chimpanzees”:
        2 × /

Checking MIME types …
DOI 10.1371/journal.pone.0076073, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3783376/bin/pone.0076073.s001.wmv, source claimed / but is video/x-ms-asf.
/usr/lib/python2.7/dist-packages/sqlalchemy/engine/default.py:463: SAWarning: Unicode type received non-unicode bind param value.
2 of 2 100% |###################################################################################################################| Time: 00:00:02
DOI 10.1371/journal.pone.0076073, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3783376/bin/pone.0076073.s002.pdf, source claimed / but is application/pdf.
Skipping download of <http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3783376/bin/pone.0076073.s001.wmv>.
Converting “/home/danielmietchen/.cache/open-access-media-importer/media/raw/pmc_pmcid/http%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2FPMC3783376%2Fbin%2Fpone.0076073.s001.wmv”, saving into “/home/danielmietchen/.cache/open-access-media-importer/media/refined/pmc_pmcid/http%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2FPMC3783376%2Fbin%2Fpone.0076073.s001.wmv.ogg” …   99% |##                    

so we will have to use timeout perhaps.

Daniel-Mietchen commented 10 years ago

Just set up a cron job

22 7 * * * sh oami_pmc_pmcid_import.sh

with oami_pmc_pmcid_import.sh consisting of



# clear cache
./oa-cache clear-database pmc_pmcid

for pmcid in $(./oa-pmc-ids --from $(date +"%F" -d '2 days ago') --until $(date +"%F")); do
  timeout 6h sh -c "echo $pmcid | ./oami_pmc_pmcid_import"
  if [[ $? == 124 ]]; then 
        echo "------------------ Timed out! --------------------"
        echo $pmcid >> "$TIMEOUTFILE"
  ./oa-cache clear-database pmc_pmcid 
erlehmann commented 10 years ago

Daniel, if i put the above code into the repository, can the bug be closed?

Daniel-Mietchen commented 10 years ago

I am now using a cronjob

12 6 * * * cd ~/open-access-media-importer; sh oami_pmc_pmcid_import.sh | tee -a oami_pmc_pmcid_import.tee

where oami_pmc_pmcid_import.sh is



# clear cache
./oa-cache clear-database pmc_pmcid

for pmcid in $(./oa-pmc-ids --from $(date +"%F" -d '3 days ago') --until $(date +"%F")); do
  timeout 6h sh -c "echo $pmcid | ./oami_pmc_pmcid_import"
  if [[ $? == 124 ]]; then 
        echo "------------------ Timed out! --------------------"
        echo $pmcid >> "$TIMEOUTFILE"
  ./oa-cache clear-database pmc_pmcid 

This works fine, so there is no need for your workaround in https://github.com/erlehmann/open-access-media-importer/issues/111#issuecomment-26286139 , and I am closing this issue.

erlehmann commented 10 years ago

For proper logging:

12 6 * * * cd ~/open-access-media-importer; sh oami_pmc_pmcid_import.sh 2>&1 | tee -a oami_pmc_pmcid_import.tee