wpoa / open-access-media-importer

A tool for harvesting media files from Open Access articles for upload into Wikimedia Commons
http://commons.wikimedia.org/wiki/User:Open_Access_Media_Importer_Bot
23 stars 8 forks source link

AssertionError in is_uploaded #84

Open Daniel-Mietchen opened 11 years ago

Daniel-Mietchen commented 11 years ago

This one affects files that have been uploaded without a description (currently 110 - most recent numbers via http://tools.wmflabs.org/catscan2/quick_intersection.php?lang=commons&project=wikimedia&cats=Media+lacking+a+description%0D%0AUploaded+with+Open+Access+Media+Importer&ns=6&depth=12&max=30000&start=0&format=html ):

danielmietchen@files:~/open-access-media-importer$ echo 10.3389/fonc.2013.00097 | ./oami_pmc_doi_import Input DOIs, delimited by whitespace: Getting PubMed Central IDs for given DOIs … found: 3656359 Downloading “http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pmc&id=3656359”, saving into directory “/home/danielmietchen/.cache/open-access-media-importer/metadata/raw/pmc_doi” … 100% |#########################################################################################################################################################| /usr/lib/python2.7/dist-packages/sqlalchemy/engine/default.py:463: SAWarning: Unicode type received non-unicode bind param value. param.append(processorskey) “Ovarian Tumor Attachment, Invasion, and Vascularization Reflect Unique Microenvironments in the Peritoneum: Insights from Xenograft and Mathematical Models”: 5 × video/x-msvideo

Checking MIME types … 5 of 5 100% |###################################################################################################################################| Time: 00:00:10 Traceback (most recent call last): File "./oa-get", line 187, in if mediawiki.is_uploaded(material): File "/home/danielmietchen/open-access-media-importer/helpers/mediawiki.py", line 80, in is_uploaded assert len(first_sentence_of_caption) > 0 AssertionError danielmietchen@files:~/open-access-media-importer$


The files from this paper are up at http://commons.wikimedia.org/wiki/File:Ovarian-Tumor-Attachment-Invasion-and-Vascularization-Reflect-Unique-Microenvironments-in-the-43049_Jiang_Movie1.ogv

http://commons.wikimedia.org/wiki/File:Ovarian-Tumor-Attachment-Invasion-and-Vascularization-Reflect-Unique-Microenvironments-in-the-43049_Jiang_Movie2.ogv

http://commons.wikimedia.org/wiki/File:Ovarian-Tumor-Attachment-Invasion-and-Vascularization-Reflect-Unique-Microenvironments-in-the-43049_Jiang_Movie3.ogv

http://commons.wikimedia.org/wiki/File:Ovarian-Tumor-Attachment-Invasion-and-Vascularization-Reflect-Unique-Microenvironments-in-the-43049_Jiang_Movie4.ogv

http://commons.wikimedia.org/wiki/File:Ovarian-Tumor-Attachment-Invasion-and-Vascularization-Reflect-Unique-Microenvironments-in-the-43049_Jiang_Movie5.ogv

erlehmann commented 10 years ago

What do you suggest to do?

Daniel-Mietchen commented 10 years ago

Instead of assert len(first_sentence_of_caption) > 0 , we should probably go for a conditional (>0 or =0).

Here is another affected DOI: 10.3389/fpsyg.2013.00372 .

erlehmann commented 10 years ago

An assertion that a string has length zero or greater zero is trivially true. Currently trying to reproduce.

erlehmann commented 10 years ago
desudesudesu ~/src/open-access-media-importer on master(2013.1-29-g1d6b0a3) tracking origin/master
1004 open-access-media-importer:master? % echo '10.3389/fpsyg.2013.00372' | ./oami_pmc_doi_import
Input DOIs, delimited by whitespace: Getting PubMed Central IDs for given DOIs … found: 3693090
Downloading “http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pmc&id=3693090”, saving into directory “/home/erlehmann/.cache/open-access-media-importer/metadata/raw/pmc_doi” …
100% |#########################################################################|
/usr/lib/python2.7/dist-packages/sqlalchemy/engine/default.py:463: SAWarning: Unicode type received non-unicode bind param value.
  param.append(processors[key](compiled_params[key]))
“Speech vs. singing: infants choose happier sounds”:
    7 × audio/basic
Checking MIME types …
DOI 10.3389/fpsyg.2013.00372, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3693090/bin/Audio1.WAV, source claimed audio/basic but is audio/x-wav.
/usr/lib/python2.7/dist-packages/sqlalchemy/engine/default.py:463: SAWarning: Unicode type received non-unicode bind param value.
  param.append(processors[key](compiled_params[key]))
DOI 10.3389/fpsyg.2013.00372, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3693090/bin/Audio2.WAV, source claimed audio/basic but is audio/x-wav.
DOI 10.3389/fpsyg.2013.00372, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3693090/bin/Audio3.WAV, source claimed audio/basic but is audio/x-wav.
DOI 10.3389/fpsyg.2013.00372, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3693090/bin/Audio4.WAV, source claimed audio/basic but is audio/x-wav.
DOI 10.3389/fpsyg.2013.00372, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3693090/bin/Audio5.WAV, source claimed audio/basic but is audio/x-wav.
DOI 10.3389/fpsyg.2013.00372, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3693090/bin/Audio6.WAV, source claimed audio/basic but is audio/x-wav.
7 of 7 100% |###################################################| Time: 00:01:06
DOI 10.3389/fpsyg.2013.00372, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3693090/bin/Audio7.WAV, source claimed audio/basic but is audio/x-wav.
Traceback (most recent call last):
  File "./oa-get", line 187, in 
    if mediawiki.is_uploaded(material):
  File "/home/erlehmann/src/open-access-media-importer/helpers/mediawiki.py", line 80, in is_uploaded
    assert len(first_sentence_of_caption) > 0
AssertionError
erlehmann commented 10 years ago

As the XML file at http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pmc&id=3693090 actually shows a caption (albeit a nonsensical – “Click here for additional data file.”), the assertion is entirely appropriate. I assume this to be a bug in the caption extraction. Working on it.

erlehmann commented 10 years ago

Empty captions fixed by commit 904a84f3b2f9fb45b877d150d26323a057673e0d.

erlehmann commented 10 years ago

Reopened as of 7c66e370ea192a4c643adeeb287e2badc8987190, see https://github.com/erlehmann/open-access-media-importer/issues/90#issuecomment-21675480.

Daniel-Mietchen commented 10 years ago

Sometimes, len(first_sentence_of_caption) is just 0, and then, assert is the wrong thing to handle that, so we should use if statements instead in order to avoid posting anything in the description field on the Commons file page.

Not sure, though, whether the assumption of len(first_sentence_of_caption) being 0 is used somewhere downstream.

erlehmann commented 10 years ago

I assumed we should always have a description, so I introduced the assertion. Your proposal seems reasonable, but I am worried about it tripping up duplicate detection.

Daniel-Mietchen commented 10 years ago

Right now, in such cases, the description just reads

{{en|1=}}

This is not helpful to anyone and messes up Commons workflows for detecting files without proper description, since the description field then is technically not empty any more.

In terms of spoiling the current duplicate detection, that would probably apply only to cases when len(first_sentence_of_caption) is indeed 0.

Daniel-Mietchen commented 10 years ago

Another DOI causing this error: 10.3389/fncel.2013.00183 .

echo 10.3389/fncel.2013.00183 | ./oami_pmc_doi_import
Input DOIs, delimited by whitespace: Getting PubMed Central IDs for given DOIs … found: 3801083
Downloading “http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pmc&id=3801083”, saving into directory “/home/danielmietchen/.cache/open-access-media-importer/metadata/raw/pmc_doi” …
100% |#########################################################################################################################################################################################################|
/usr/lib/python2.7/dist-packages/sqlalchemy/engine/default.py:463: SAWarning: Unicode type received non-unicode bind param value.
  param.append(processors[key](compiled_params[key]))
“Imaging neuron-glia interactions in the enteric nervous system”:
        4 × /

Checking MIME types …
DOI 10.3389/fncel.2013.00183, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3801083/bin/67045_Vanden_Berghe_Movie_1.AVI, source claimed / but is video/x-msvideo.                                 | ETA:  00:00:00
/usr/lib/python2.7/dist-packages/sqlalchemy/engine/default.py:463: SAWarning: Unicode type received non-unicode bind param value.
  param.append(processors[key](compiled_params[key]))
DOI 10.3389/fncel.2013.00183, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3801083/bin/67045_Vanden_Berghe_Movie_2.AVI, source claimed / but is video/x-msvideo.                                 | ETA:  00:00:02
DOI 10.3389/fncel.2013.00183, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3801083/bin/67045_Vanden_Berghe_Movie_3.AVI, source claimed / but is video/x-msvideo.                                 | ETA:  00:00:01
4 of 4 100% |###################################################################################################################################################################################| Time: 00:00:05
DOI 10.3389/fncel.2013.00183, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3801083/bin/67045_Vanden_Berghe_Movie_4.AVI, source claimed / but is video/x-msvideo.
Traceback (most recent call last):
  File "./oa-get", line 187, in <module>
    if mediawiki.is_uploaded(material):
  File "/home/danielmietchen/open-access-media-importer/helpers/mediawiki.py", line 80, in is_uploaded
    assert len(first_sentence_of_caption) > 0
AssertionError