Open Daniel-Mietchen opened 11 years ago
What do you suggest to do?
Instead of assert len(first_sentence_of_caption) > 0 , we should probably go for a conditional (>0 or =0).
Here is another affected DOI: 10.3389/fpsyg.2013.00372 .
An assertion that a string has length zero or greater zero is trivially true. Currently trying to reproduce.
desudesudesu ~/src/open-access-media-importer on master(2013.1-29-g1d6b0a3) tracking origin/master 1004 open-access-media-importer:master? % echo '10.3389/fpsyg.2013.00372' | ./oami_pmc_doi_import Input DOIs, delimited by whitespace: Getting PubMed Central IDs for given DOIs … found: 3693090 Downloading “http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pmc&id=3693090”, saving into directory “/home/erlehmann/.cache/open-access-media-importer/metadata/raw/pmc_doi” … 100% |#########################################################################| /usr/lib/python2.7/dist-packages/sqlalchemy/engine/default.py:463: SAWarning: Unicode type received non-unicode bind param value. param.append(processors[key](compiled_params[key])) “Speech vs. singing: infants choose happier sounds”: 7 × audio/basic Checking MIME types … DOI 10.3389/fpsyg.2013.00372, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3693090/bin/Audio1.WAV, source claimed audio/basic but is audio/x-wav. /usr/lib/python2.7/dist-packages/sqlalchemy/engine/default.py:463: SAWarning: Unicode type received non-unicode bind param value. param.append(processors[key](compiled_params[key])) DOI 10.3389/fpsyg.2013.00372, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3693090/bin/Audio2.WAV, source claimed audio/basic but is audio/x-wav. DOI 10.3389/fpsyg.2013.00372, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3693090/bin/Audio3.WAV, source claimed audio/basic but is audio/x-wav. DOI 10.3389/fpsyg.2013.00372, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3693090/bin/Audio4.WAV, source claimed audio/basic but is audio/x-wav. DOI 10.3389/fpsyg.2013.00372, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3693090/bin/Audio5.WAV, source claimed audio/basic but is audio/x-wav. DOI 10.3389/fpsyg.2013.00372, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3693090/bin/Audio6.WAV, source claimed audio/basic but is audio/x-wav. 7 of 7 100% |###################################################| Time: 00:01:06 DOI 10.3389/fpsyg.2013.00372, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3693090/bin/Audio7.WAV, source claimed audio/basic but is audio/x-wav. Traceback (most recent call last): File "./oa-get", line 187, inif mediawiki.is_uploaded(material): File "/home/erlehmann/src/open-access-media-importer/helpers/mediawiki.py", line 80, in is_uploaded assert len(first_sentence_of_caption) > 0 AssertionError
As the XML file at http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pmc&id=3693090 actually shows a caption (albeit a nonsensical – “Click here for additional data file.”), the assertion is entirely appropriate. I assume this to be a bug in the caption extraction. Working on it.
Empty captions fixed by commit 904a84f3b2f9fb45b877d150d26323a057673e0d.
Reopened as of 7c66e370ea192a4c643adeeb287e2badc8987190, see https://github.com/erlehmann/open-access-media-importer/issues/90#issuecomment-21675480.
Sometimes, len(first_sentence_of_caption) is just 0, and then, assert is the wrong thing to handle that, so we should use if statements instead in order to avoid posting anything in the description field on the Commons file page.
Not sure, though, whether the assumption of len(first_sentence_of_caption) being 0 is used somewhere downstream.
I assumed we should always have a description, so I introduced the assertion. Your proposal seems reasonable, but I am worried about it tripping up duplicate detection.
Right now, in such cases, the description just reads
{{en|1=}}
This is not helpful to anyone and messes up Commons workflows for detecting files without proper description, since the description field then is technically not empty any more.
In terms of spoiling the current duplicate detection, that would probably apply only to cases when len(first_sentence_of_caption) is indeed 0.
Another DOI causing this error: 10.3389/fncel.2013.00183 .
echo 10.3389/fncel.2013.00183 | ./oami_pmc_doi_import
Input DOIs, delimited by whitespace: Getting PubMed Central IDs for given DOIs … found: 3801083
Downloading “http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pmc&id=3801083”, saving into directory “/home/danielmietchen/.cache/open-access-media-importer/metadata/raw/pmc_doi” …
100% |#########################################################################################################################################################################################################|
/usr/lib/python2.7/dist-packages/sqlalchemy/engine/default.py:463: SAWarning: Unicode type received non-unicode bind param value.
param.append(processors[key](compiled_params[key]))
“Imaging neuron-glia interactions in the enteric nervous system”:
4 × /
Checking MIME types …
DOI 10.3389/fncel.2013.00183, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3801083/bin/67045_Vanden_Berghe_Movie_1.AVI, source claimed / but is video/x-msvideo. | ETA: 00:00:00
/usr/lib/python2.7/dist-packages/sqlalchemy/engine/default.py:463: SAWarning: Unicode type received non-unicode bind param value.
param.append(processors[key](compiled_params[key]))
DOI 10.3389/fncel.2013.00183, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3801083/bin/67045_Vanden_Berghe_Movie_2.AVI, source claimed / but is video/x-msvideo. | ETA: 00:00:02
DOI 10.3389/fncel.2013.00183, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3801083/bin/67045_Vanden_Berghe_Movie_3.AVI, source claimed / but is video/x-msvideo. | ETA: 00:00:01
4 of 4 100% |###################################################################################################################################################################################| Time: 00:00:05
DOI 10.3389/fncel.2013.00183, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3801083/bin/67045_Vanden_Berghe_Movie_4.AVI, source claimed / but is video/x-msvideo.
Traceback (most recent call last):
File "./oa-get", line 187, in <module>
if mediawiki.is_uploaded(material):
File "/home/danielmietchen/open-access-media-importer/helpers/mediawiki.py", line 80, in is_uploaded
assert len(first_sentence_of_caption) > 0
AssertionError
This one affects files that have been uploaded without a description (currently 110 - most recent numbers via http://tools.wmflabs.org/catscan2/quick_intersection.php?lang=commons&project=wikimedia&cats=Media+lacking+a+description%0D%0AUploaded+with+Open+Access+Media+Importer&ns=6&depth=12&max=30000&start=0&format=html ):
danielmietchen@files:~/open-access-media-importer$ echo 10.3389/fonc.2013.00097 | ./oami_pmc_doi_import Input DOIs, delimited by whitespace: Getting PubMed Central IDs for given DOIs … found: 3656359 Downloading “http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pmc&id=3656359”, saving into directory “/home/danielmietchen/.cache/open-access-media-importer/metadata/raw/pmc_doi” … 100% |#########################################################################################################################################################| /usr/lib/python2.7/dist-packages/sqlalchemy/engine/default.py:463: SAWarning: Unicode type received non-unicode bind param value. param.append(processorskey) “Ovarian Tumor Attachment, Invasion, and Vascularization Reflect Unique Microenvironments in the Peritoneum: Insights from Xenograft and Mathematical Models”: 5 × video/x-msvideo
Checking MIME types … 5 of 5 100% |###################################################################################################################################| Time: 00:00:10 Traceback (most recent call last): File "./oa-get", line 187, in
if mediawiki.is_uploaded(material):
File "/home/danielmietchen/open-access-media-importer/helpers/mediawiki.py", line 80, in is_uploaded
assert len(first_sentence_of_caption) > 0
AssertionError
danielmietchen@files:~/open-access-media-importer$
The files from this paper are up at http://commons.wikimedia.org/wiki/File:Ovarian-Tumor-Attachment-Invasion-and-Vascularization-Reflect-Unique-Microenvironments-in-the-43049_Jiang_Movie1.ogv
http://commons.wikimedia.org/wiki/File:Ovarian-Tumor-Attachment-Invasion-and-Vascularization-Reflect-Unique-Microenvironments-in-the-43049_Jiang_Movie2.ogv
http://commons.wikimedia.org/wiki/File:Ovarian-Tumor-Attachment-Invasion-and-Vascularization-Reflect-Unique-Microenvironments-in-the-43049_Jiang_Movie3.ogv
http://commons.wikimedia.org/wiki/File:Ovarian-Tumor-Attachment-Invasion-and-Vascularization-Reflect-Unique-Microenvironments-in-the-43049_Jiang_Movie4.ogv
http://commons.wikimedia.org/wiki/File:Ovarian-Tumor-Attachment-Invasion-and-Vascularization-Reflect-Unique-Microenvironments-in-the-43049_Jiang_Movie5.ogv