wpoa / open-access-media-importer

A tool for harvesting media files from Open Access articles for upload into Wikimedia Commons
http://commons.wikimedia.org/wiki/User:Open_Access_Media_Importer_Bot
23 stars 8 forks source link

Duplication test failed #35

Closed Daniel-Mietchen closed 11 years ago

Daniel-Mietchen commented 11 years ago

10.1371/journal.pone.0047867:

Video was already on Commons at http://commons.wikimedia.org/wiki/File:Archer_fish_shooting_at_prey.ogv (archived at http://www.webcitation.org/6C3frhrTq ) but the bot uploaded it again at http://commons.wikimedia.org/wiki/File:How-Archer-Fish-Achieve-a-Powerful-Impact-Hydrodynamic-Instability-of-a-Pulsed-Jet-in-Toxotes-pone.0047867.s001.ogv (archived at http://www.webcitation.org/6C3fiQsip ), so it had to be nominated for deletion (cf. http://www.webcitation.org/6C3gDOkGW ) , which is to be avoided.

Daniel-Mietchen commented 11 years ago

Some more examples: http://commons.wikimedia.org/wiki/File:Overview-on-the-Diversity-of-Sounds-Produced-by-Clownfishes-(Pomacentridae)-Importance-of-Acoustic-pone.0049179.s001.ogv and http://commons.wikimedia.org/wiki/File:Overview-on-the-Diversity-of-Sounds-Produced-by-Clownfishes-(Pomacentridae)-Importance-of-Acoustic-pone.0049179.s002.ogv were already there under http://commons.wikimedia.org/wiki/File:Amphiprion_frenatus_aggressive_sounds_-_journal.pone.0049179.s001.ogv and http://commons.wikimedia.org/wiki/File:Amphiprion_frenatus_submissive_sounds_-_journal.pone.0049179.s002.ogv

Daniel-Mietchen commented 11 years ago

Some more: http://commons.wikimedia.org/wiki/File:Rapid-Inversion-Running-Animals-and-Robots-Swing-like-a-Pendulum-under-Ledges-pone.0038003.s001.ogv and http://commons.wikimedia.org/wiki/File:Rapid-Inversion-Running-Animals-and-Robots-Swing-like-a-Pendulum-under-Ledges-pone.0038003.s002.ogv and http://commons.wikimedia.org/wiki/File:Rapid-Inversion-Running-Animals-and-Robots-Swing-like-a-Pendulum-under-Ledges-pone.0038003.s003.ogv and http://commons.wikimedia.org/wiki/File:Rapid-Inversion-Running-Animals-and-Robots-Swing-like-a-Pendulum-under-Ledges-pone.0038003.s004.ogv and http://commons.wikimedia.org/wiki/File:Rapid-Inversion-Running-Animals-and-Robots-Swing-like-a-Pendulum-under-Ledges-pone.0038003.s005.ogv and http://commons.wikimedia.org/wiki/File:Rapid-Inversion-Running-Animals-and-Robots-Swing-like-a-Pendulum-under-Ledges-pone.0038003.s006.ogv replicate http://commons.wikimedia.org/wiki/File:Periplaneta_americana_performing_a_high-speed_inversion_on_a_ramp_(top_view)_-_Journal.pone.0038003.s001.ogv and http://commons.wikimedia.org/wiki/File:Periplaneta_americana_performing_a_high-speed_inversion_on_a_ramp_(side_view)_-_Journal.pone.0038003.s002.ogv and http://commons.wikimedia.org/wiki/File:Periplaneta_americana_attempting_to_perform_an_inversion_after_claw_ablation,_but_failing_-_Journal.pone.0038003.s003.ogv and http://commons.wikimedia.org/wiki/File:Hemidactylus_platyurus_performing_a_high-speed_inversion_on_a_ramp_-_Journal.pone.0038003.s004.ogv and http://commons.wikimedia.org/wiki/File:Hemidactylus_platyurus_performing_a_high-speed_inversion_on_a_leaf_-_Journal.pone.0038003.s005.ogv and http://commons.wikimedia.org/wiki/File:Robot_running_at_high-speed_performing_rapid_inversion_-_Journal.pone.0038003.s006.ogv .

Daniel-Mietchen commented 11 years ago

For duplicate detection, please keep in mind that there is no guarantee that a file uploaded by our bot under a given file name will keep that name - it is quite common for files to be renamed, so that name and content fit better.

Daniel-Mietchen commented 11 years ago

Another example where duplicate detection fails: 10.1371/journal.pone.0050188

all already uploaded: http://commons.wikimedia.org/w/index.php?title=Special%3ASearch&profile=default&search=p114RhoGEF&fulltext=Search .

Daniel-Mietchen commented 11 years ago

Here's another case where duplicate detection failed: http://commons.wikimedia.org/wiki/File:What-the-hyenas-laugh-tells-Sex-age-dominance-and-individual-signature-in-the-giggling-call-of-1472-6785-10-9-S1.ogv , which is a duplicate of http://commons.wikimedia.org/wiki/File:Giggling_call_of_a_spotted_hyena_(Crocuta_crocuta)_-_1472-6785-10-9-S1.oga , which in turn has been renamed from http://commons.wikimedia.org/w/index.php?title=File:What-the-hyena%27s-laugh-tells-Sex-age-dominance-and-individual-signature-in-the-giggling-call-of-1472-6785-10-9-S1.ogv&redirect=no .

This easily gets annoying if - like here - it's not just -S1 but the whole sequence until -S9.ogv

Note that the title contains an apostrophe, which should have remained there.

erlehmann commented 11 years ago

I think that we get those false negatives because we search for duplicates using the supplementary material label. For 10.1371/journal.pone.0047867, for example, the original upload did not include “Video S1”.

Daniel-Mietchen commented 11 years ago

I am now thinking of a more robust system for duplicate detection, one that allows for file descriptions to be updated.

Two main avenues come to mind: (1) We could have a list somewhere (DB, GitHub, wiki) that, for each PMCID, contains a list of all the associated supplementary audio and video files as named on PMC and on Commons. Perhaps a flag could be set if all are on Commons. (2) We could think of adding a comment line (that contains, e.g., the original file name from PMC) to the respective pages on Commons.

Daniel-Mietchen commented 11 years ago

While re-reading this thread, I noticed that it does not contain any hint at an obvious way to check for duplicates: the file name that we use for uploading always contains a portion of the DOI plus the identifier of the file within the article (e.g. "pone.0061541" and "s005" in http://commons.wikimedia.org/wiki/File:Dual-Action-of-BPC194-A-Membrane-Active-Peptide-Killing-Bacterial-Cells-pone.0061541.s005.ogv ), linked by a dot and followed by the dot before the final file extension. Shouldn't this be robust enough to detect duplicates?

Just this week, http://commons.wikimedia.org/wiki/File:A-Simple-Sign-for-Recognizing-Off-Axis-OCT-Measurement-Beam-Placement-in-the-Context-of-Multicentre-pone.0048222.s003.ogv was uploaded, even though http://commons.wikimedia.org/wiki/File:A-Simple-Sign-for-Recognizing-Off%E2%80%93Axis-OCT-Measurement-Beam-Placement-in-the-Context-of-Multicentre-pone.0048222.s003.ogv was already there.

erlehmann commented 11 years ago

With that solution OAMI would neither detect files uploaded by other entities nor detect its own uploads if their names are changed.

Daniel-Mietchen commented 11 years ago

Name changes on Commons will always leave a redirect, so the page with the DOI-based ending would exist and be detectable. Example: http://commons.wikimedia.org/w/index.php?title=File:Localized-Brain-Activation-Related-to-the-Strength-of-Auditory-Learning-in-a-Parrot-pone.0038803.s002.ogv&redirect=no .

There are on the order of 10 multimedia files that are available via PMC and that have been uploaded to Commons neither through the OAMI nor by me. All of them got there before OAMI came around, and their uploaders are likely aware of the tool by now.

The improper duplicate detection that we have now affects thousands of files.

Another option to improve duplicate detection would be to record the source file's URL somewhere in the metadata (e.g. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3631234/bin/pone.0062199.s002.mov in http://commons.wikimedia.org/wiki/File:Role-of-Sensory-Experience-in-Functional-Development-of-Drosophila-Motor-Circuits-pone.0062199.s002.ogv ). But that would only work reasonably well if we were to edit all existing OAMI-created pages accordingly.

Daniel-Mietchen commented 11 years ago

Changed priority to "Do it now", as described in https://github.com/erlehmann/open-access-media-importer/issues/35#issuecomment-17107447 .

Daniel-Mietchen commented 11 years ago

Much better now. Still doing some more tests.

erlehmann commented 11 years ago

There is now a script:

echo '10.1371/journal.pone.0047867' | ./oami_pmc_doi_detect_duplicates

erlehmann commented 11 years ago

Seems the new duplicate detection has everything covered:

desudesudesu ~/src/open-access-media-importer on master(2013.1-26-g6071d68) tracking origin/master
1032 open-access-media-importer:master? % ./oami_pmc_doi_detect_duplicates_test
Removing “/home/erlehmann/.local/share/open-access-media-importer/pmc_doi.sqlite” … done.
Input DOIs, delimited by whitespace: Getting PubMed Central IDs for given DOIs … found: 3631234, 3631201, 3501466, 3493550, 3480456, 3372503, 2859383
Downloading “http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pmc&id=3631234&id=3631201&id=3501466&id=3493550&id=3480456&id=3372503&id=2859383”, saving into directory “/home/erlehmann/.cache/open-access-media-importer/metadata/raw/pmc_doi” …
100% |########################################################################|
/usr/lib/python2.7/dist-packages/sqlalchemy/engine/default.py:463: SAWarning: Unicode type received non-unicode bind param value.
  param.append(processors[key](compiled_params[key]))
“Role of Sensory Experience in Functional Development of Drosophila Motor Circuits”:
    1 × video/quicktime
    1 × image/tiff
“Dual Action of BPC194: A Membrane Active Peptide Killing Bacterial Cells”:
    5 × video/x-msvideo
    1 × application/msword
“Stimulation of Cortical Myosin Phosphorylation by p114RhoGEF Drives Cell Migration and Tumor Cell Invasion”:
    6 × video/quicktime
    2 × image/tiff
“A Simple Sign for Recognizing Off-Axis OCT Measurement Beam Placement in the Context of Multicentre Studies”:
    1 × video/x-ms-wmv
    2 × application/msword
“How Archer Fish Achieve a Powerful Impact: Hydrodynamic Instability of a Pulsed Jet in Toxotes jaculatrix”:
    1 × video/quicktime
    1 × image/tiff
“Localized Brain Activation Related to the Strength of Auditory Learning in a Parrot”:
    2 × text/plain
“What the hyena's laugh tells: Sex, age, dominance and individual signature in the giggling call of Crocuta crocuta”:
    8 × audio/wav
Checking MIME types …
/usr/lib/python2.7/dist-packages/sqlalchemy/engine/default.py:463: SAWarning: Unicode type received non-unicode bind param value.
  param.append(processors[key](compiled_params[key]))
DOI 10.1371/journal.pone.0048222, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3493550/bin/pone.0048222.s003.wmv, source claimed video/x-ms-wmv but is video/x-ms-asf.
DOI 10.1371/journal.pone.0038803, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3372503/bin/pone.0038803.s001.m4a, source claimed text/plain but is audio/mp4.
DOI 10.1371/journal.pone.0038803, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3372503/bin/pone.0038803.s002.m4a, source claimed text/plain but is audio/mp4.
DOI 10.1186/1472-6785-10-9, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2859383/bin/1472-6785-10-9-S1.WAV, source claimed audio/wav but is audio/x-wav.
DOI 10.1186/1472-6785-10-9, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2859383/bin/1472-6785-10-9-S2.WAV, source claimed audio/wav but is audio/x-wav.
DOI 10.1186/1472-6785-10-9, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2859383/bin/1472-6785-10-9-S3.WAV, source claimed audio/wav but is audio/x-wav.
DOI 10.1186/1472-6785-10-9, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2859383/bin/1472-6785-10-9-S4.WAV, source claimed audio/wav but is audio/x-wav.
DOI 10.1186/1472-6785-10-9, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2859383/bin/1472-6785-10-9-S5.WAV, source claimed audio/wav but is audio/x-wav.
DOI 10.1186/1472-6785-10-9, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2859383/bin/1472-6785-10-9-S6.WAV, source claimed audio/wav but is audio/x-wav.
DOI 10.1186/1472-6785-10-9, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2859383/bin/1472-6785-10-9-S7.WAV, source claimed audio/wav but is audio/x-wav.
31 of 31 100% |################################################| Time: 00:02:06
DOI 10.1186/1472-6785-10-9, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2859383/bin/1472-6785-10-9-S8.WAV, source claimed audio/wav but is audio/x-wav.
/usr/lib/python2.7/dist-packages/sqlalchemy/engine/default.py:463: SAWarning: Unicode type received non-unicode bind param value.
  param.append(processors[key](compiled_params[key]))
[X] 10.1371/journal.pone.0062199  Movie S1
[ ] 10.1371/journal.pone.0061541  Movie S1
[ ] 10.1371/journal.pone.0061541  Movie S2
[ ] 10.1371/journal.pone.0061541  Movie S3
[X] 10.1371/journal.pone.0061541  Movie S4
[X] 10.1371/journal.pone.0061541  Movie S5
[X] 10.1371/journal.pone.0050188  Movie S1
[X] 10.1371/journal.pone.0050188  Movie S2
[X] 10.1371/journal.pone.0050188  Movie S3
[X] 10.1371/journal.pone.0050188  Movie S4
[X] 10.1371/journal.pone.0050188  Movie S5
[X] 10.1371/journal.pone.0050188  Movie S6
[X] 10.1371/journal.pone.0048222  Video S1
[X] 10.1371/journal.pone.0047867  Video S1
[X] 10.1186/1472-6785-10-9 Additional file 1 
[X] 10.1186/1472-6785-10-9 Additional file 2 
[X] 10.1186/1472-6785-10-9 Additional file 3 
[X] 10.1186/1472-6785-10-9 Additional file 4 
[X] 10.1186/1472-6785-10-9 Additional file 5 
[X] 10.1186/1472-6785-10-9 Additional file 6 
[X] 10.1186/1472-6785-10-9 Additional file 7 
[X] 10.1186/1472-6785-10-9 Additional file 8 
desudesudesu ~/src/open-access-media-importer on master(2013.1-26-g6071d68) tracking origin/master
1033 open-access-media-importer:master? %  2013-05-25 16:05:55 erlehmann pts/5
Daniel-Mietchen commented 11 years ago

10.1371/journal.pbio.1001566 just failed - all files already up (via oami_pmc_pmcid_import) at http://commons.wikimedia.org/wiki/File:Elimination-of-Self-Reactive-T-Cells-in-the-Thymus-A-Timeline-for-Negative-Selection-pbio.1001566.s005.ogv but oami_pmc_doi_import still pretended to "upload" them anew.

Daniel-Mietchen commented 11 years ago

Same situation with 10.1371/journal.pcbi.1003069.

Daniel-Mietchen commented 11 years ago

Another case: 10.1371/journal.pone.0011506