wpoa / recitation-bot

MediaWiki bot to upload content to Wikimedia projects and update corresponding citations on Wikipedia.
GNU General Public License v3.0
9 stars 3 forks source link

Detect duplicates #34

Open Daniel-Mietchen opened 9 years ago

Daniel-Mietchen commented 9 years ago

Fig. 1 of https://en.wikisource.org/w/index.php?title=Wikisource:WikiProject_Open_Access/Programmatic_import_from_PubMed_Central/Modelling_the_Species_Distribution_of_Flat-Headed_Cats_%28Prionailurus_planiceps%29_an_Endangered_South-East_Asian_Small_Felid&oldid=5032599 was imported into https://commons.wikimedia.org/wiki/File:Modelling-the-Species-Distribution-of-Flat-Headed-Cats-%28Prionailurus-planiceps%29-an-Endangered-South-pone.0009612.g001.jpg but the image there already existed (in higher resolution) as https://commons.wikimedia.org/wiki/File:Plionailurus_planiceps.png . According to Commons policies, our upload should thus be deleted.

In such cases, it would be best if we could (a) detect such a duplicate before upload (b) post a message on that file's talk page with the proper metadata.

notconfusing commented 9 years ago

Its because the sizes are different. We have been over this problem before (though I can't find an issue for it). Without implementing computer vision algorithms it'll be diffucult to detect. The other avenue we tried was to get pubmed to give us the maximum resolution images they had, but after some time they responded that their API will not support this. So we need some fresh ideas.

Max Klein ‽ http://notconfusing.com/

On Sun, Sep 7, 2014 at 2:50 PM, Daniel Mietchen notifications@github.com wrote:

Assigned #34 https://github.com/wpoa/recitation-bot/issues/34 to @notconfusing https://github.com/notconfusing.

— Reply to this email directly or view it on GitHub https://github.com/wpoa/recitation-bot/issues/34#event-162250354.

jure commented 9 years ago

I think you're right, and this problem won't be easily solved without some image similarity magic. There's a good list of (and discussion about) applicable open source solutions here: http://ejohn.org/blog/image-similarity-search-wanted

For Python specifically, this looks pretty useful: http://www.guguncube.com/1656/python-image-similarity-comparison-using-several-techniques

notconfusing commented 9 years ago

thanks @jure i've never looked into python image similarity before that seems to a be a good starting point.