wpoa / open-access-media-importer

A tool for harvesting media files from Open Access articles for upload into Wikimedia Commons
http://commons.wikimedia.org/wiki/User:Open_Access_Media_Importer_Bot
23 stars 8 forks source link

Duplication detected when file is actually not on Commons #128

Closed Daniel-Mietchen closed 9 years ago

Daniel-Mietchen commented 10 years ago

I found no indication that the video actually exists on Commons, despite the duplicate detection telling me so.

danielmietchen@files:~/open-access-media-importer$ echo 10.3897/BDJ.1.e1013 | ./oami_pmc_doi_import
Input DOIs, delimited by whitespace: Getting PubMed Central IDs for given DOIs … found: 3964625
Downloading “http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pmc&id=3964625”, saving into directory “/home/danielmietchen/.cache/open-access-media-importer/metadata/raw/pmc_doi” …
100% |#########################################################################|
/usr/lib/python2.7/dist-packages/sqlalchemy/engine/default.py:463: SAWarning: Unicode type received non-unicode bind param value.
  param.append(processors[key](compiled_params[key]))
“Eupolybothrus
cavernicolus Komeri?ki & Stoev sp. n. (Chilopoda: Lithobiomorpha: Lithobiidae): the first eukaryotic species description combining transcriptomic, DNA barcoding and micro-CT imaging data”:
    1 × RAR Archive/rar
    2 × text/xml
    2 × /

Checking MIME types …
DOI 10.3897/BDJ.1.e1013, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3964625/bin/BDJ.1.e1013-treatment1.xml, source claimed text/xml but is text/plain.
/usr/lib/python2.7/dist-packages/sqlalchemy/engine/default.py:463: SAWarning: Unicode type received non-unicode bind param value.
  param.append(processors[key](compiled_params[key]))
DOI 10.3897/BDJ.1.e1013, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3964625/bin/BDJ.1.e1013-treatment2.xml, source claimed text/xml but is text/plain.
DOI 10.3897/BDJ.1.e1013, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3964625/bin/biodiversity_data_journal-1-e1013-s001.rar, source claimed RAR Archive/rar but is application/x-rar.
4 of 4 100% |###################################################| Time: 00:00:02
DOI 10.3897/BDJ.1.e1013, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3964625/bin/biodiversity_data_journal-1-e1013-g021.mp4, source claimed / but is video/mp4.
Skipping <http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3964625/bin/biodiversity_data_journal-1-e1013-g021.mp4>, already exists at Wikimedia Commons.
erlehmann commented 10 years ago

Would it help if the upload detection mechanism printed where on Commons it thinks the file is located?

Daniel-Mietchen commented 10 years ago

Yes. But that wouldn't fix this one. Perhaps we could go for an option to manually override duplicate detection? That could also help with fixing conversion or metadata issues for files that we have already ulpoaded.

erlehmann commented 9 years ago

In the wmde-review branch there are now options for overriding everything: • oa-cache convert-media can take “--force-conversion” • oa-get download-media can take ”--force-download” • oa-put upload-media can take “--force-upload”

erlehmann commented 9 years ago

Fixed in wmde-review branch.