yarl / pattypan

Upload files to Wikimedia Commons. The Spreadsheet Way.
https://commons.wikimedia.org/wiki/Commons:Pattypan
MIT License
56 stars 37 forks source link

Pattypan does no longer check for duplicate files using hashes before an upload #151

Open Abbe98 opened 2 years ago

Abbe98 commented 2 years ago

Following migration to a newer version of Wiki.java we lost the feature which checked for duplicate files using hashes, we should bring this back either by adding support for such a feature to Wiki.java or by implementing it on our end.

Keep in mind that we need to support the upload by URL feature.

don-vip commented 2 years ago

Hi! If it can help you this is a feature I implemented in my tool with good results:

https://github.com/toolforge/tool-spacemedia/blob/master/sm-apps/sm-cronjobs/sm-downloader/src/main/java/org/wikimedia/commons/donvip/spacemedia/downloader/HashHelper.java

https://github.com/toolforge/tool-spacemedia/blob/master/sm-legacyapp/src/main/java/org/wikimedia/commons/donvip/spacemedia/service/MediaService.java#L116

It's based on https://github.com/KilianB/JImageHash to not only search by exact SHA-1 match but also by "perceptive" hash so that other duplicates are also detected.

Abbe98 commented 2 years ago

I started a discussion upstream around more generic support for warnings: https://github.com/MER-C/wiki-java/issues/154#issuecomment-1035071621