wpoa / open-access-media-importer

A tool for harvesting media files from Open Access articles for upload into Wikimedia Commons
http://commons.wikimedia.org/wiki/User:Open_Access_Media_Importer_Bot
23 stars 8 forks source link

Naming of cached files #83

Closed Daniel-Mietchen closed 11 years ago

Daniel-Mietchen commented 11 years ago

Most files have a name that is unique enough to be used for duplicate detection in the respective cache directory. Some, however, are not.

I am pasting in an example below. It features a "Movie1.MP4", and since the cache directory already has a "Movie1.MP4.ogv", the duplicate detection assumes the file has already been converted. Alas, it hasn't, and the script then attempts to upload the old file (which is over 300 MB and thus not uploaded, as per https://github.com/erlehmann/open-access-media-importer/issues/22 ). I have no idea how many such false uploads have already happened, but I would suspect on the order of 10.

I have provisionally renamed the original one into Movie1.MP4-306693161.ogv (the number is simply its size - did not see an easy way to determine the DOI) but we should check and modify the workflows here, so as to ensure that the files in the cache directories always have unique file names - preferably with a good overlap to the DOI.


So 30. Jun 00:34:34 CEST 2013 doi: 10.3389/fnsys.2013.00023 Removing “/home/danielmietchen/.local/share/open-access-media-importer/pmc_doi.sqlite” … done. Input DOIs, delimited by whitespace: Getting PubMed Central IDs for given DOIs … found: 3691547 Downloading “http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pmc&id=3691547”, saving into directory “/home/danielmietchen/.cache/open-access-media-importer/metadata/raw/pmc_doi” … 100% |#########################################################################################################################################################| /usr/lib/python2.7/dist-packages/sqlalchemy/engine/default.py:463: SAWarning: Unicode type received non-unicode bind param value. param.append(processorskey) “Laminar firing and membrane dynamics in four visual areas exposed to two objects moving to occlusion”: 3 × video/quicktime

Checking MIME types … DOI 10.3389/fnsys.2013.00023, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3691547/bin/Movie1.MP4, source claimed video/quicktime but is video/mp4.ETA: 00:00:00 /usr/lib/python2.7/dist-packages/sqlalchemy/engine/default.py:463: SAWarning: Unicode type received non-unicode bind param value. param.append(processorskey) 3 of 3 100% |###################################################################################################################################| Time: 00:00:04 DOI 10.3389/fnsys.2013.00023, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3691547/bin/Movie3.MPG, source claimed video/quicktime but is video/mpeg. Downloading http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3691547/bin/Movie1.MP4, saving into directory “/home/danielmietchen/.cache/open-access-media-importer/media/raw/pmc_doi” … 100% |#########################################################################################################################################################| Downloading http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3691547/bin/Movie2.MOV, saving into directory “/home/danielmietchen/.cache/open-access-media-importer/media/raw/pmc_doi” … 100% |#########################################################################################################################################################| Downloading http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3691547/bin/Movie3.MPG, saving into directory “/home/danielmietchen/.cache/open-access-media-importer/media/raw/pmc_doi” … 100% |#########################################################################################################################################################| Skipping conversion of “/home/danielmietchen/.cache/open-access-media-importer/media/raw/pmc_doi/Movie1.MP4”, exists at “/home/danielmietchen/.cache/open-access-media-importer/media/refined/pmc_doi/Movie1.MP4.ogv”. Converting “/home/danielmietchen/.cache/open-access-media-importer/media/raw/pmc_doi/Movie2.MOV”, saving into “/home/danielmietchen/.cache/open-access-media-importer/media/refined/pmc_doi/Movie2.MOV.ogv” … 9% |############### done.|####################################################################################################################################################### | Converting “/home/danielmietchen/.cache/open-access-media-importer/media/raw/pmc_doi/Movie3.MPG”, saving into “/home/danielmietchen/.cache/open-access-media-importer/media/refined/pmc_doi/Movie3.MPG.ogv” … report (00:00:05): 10 / 19 seconds (52,6 %) report (00:00:08): 18 / 19 seconds (94,7 %) done. Authenticating with http://commons.wikimedia.org/w/api.php. ^CTraceback (most recent call last): File "./oa-put", line 111, in mediawiki.upload(media_refined_path, wiki_filename, page_template) File "/home/danielmietchen/open-access-media-importer/helpers/mediawiki.py", line 116, in upload comment = 'Automatically uploaded media file from [[:en:Open access|Open Access]] source. Please report problems or suggestions [[User talk:Open Access Media Importer Bot|here]].' File "/home/danielmietchen/open-access-media-importer/helpers/wikitools/wikifile.py", line 228, in upload req = api.APIRequest(self.site, params, write=True, multipart=bool(fileobj)) File "/home/danielmietchen/open-access-media-importer/helpers/wikitools/api.py", line 71, in init self.encodeddata = self.encodeddata + singledata KeyboardInterrupt

danielmietchen@files:~/open-access-media-importer$ ls -l /home/danielmietchen/.cache/open-access-media-importer/media/refined/pmc_doi/Movie1.MP4.ogv -rw-r--r-- 1 danielmietchen danielmietchen 306693161 Jun 6 04:51 /home/danielmietchen/.cache/open-access-media-importer/media/refined/pmc_doi/Movie1.MP4.ogv danielmietchen@files:~/open-access-media-importer$ mv /home/danielmietchen/.cache/open-access-media-importer/media/refined/pmc_doi/Movie1.MP4.ogv /home/danielmietchen/.cache/open-access-media-importer/media/refined/pmc_doi/Movie1.MP4-306693161.ogv

Daniel-Mietchen commented 11 years ago

The files from that example article are now up: http://commons.wikimedia.org/wiki/File:Laminar-firing-and-membrane-dynamics-in-four-visual-areas-exposed-to-two-objects-moving-to-occlusion-Movie1.ogv

http://commons.wikimedia.org/wiki/File:Laminar-firing-and-membrane-dynamics-in-four-visual-areas-exposed-to-two-objects-moving-to-occlusion-Movie2.ogv

http://commons.wikimedia.org/wiki/File:Laminar-firing-and-membrane-dynamics-in-four-visual-areas-exposed-to-two-objects-moving-to-occlusion-Movie3.ogv

Daniel-Mietchen commented 11 years ago

Perhaps it's best to always take the filename as it is and then move (in the cache directory) to a new name that contains the part of the DOI that comes after the slash. If we do this after the upload, the current workflow would not even have to be modified unless for cases when the conversion (cf. https://github.com/erlehmann/open-access-media-importer/issues?labels=GStreamer&page=1&state=open ) or upload (cf. https://github.com/erlehmann/open-access-media-importer/issues/22 ) fails.

For the upload, we can stick to the file name we are using now, since the article title is probably good enough for disambiguation.

erlehmann commented 11 years ago

For temporary storage, I am going to Base64-encode the URL of the file and append the file name extension. This should be unique and useful enough at the same time.

erlehmann commented 11 years ago

This proposal means that the local filename for http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3693090/bin/Audio3.WAV becomes “aHR0cDovL3d3dy5uY2JpLm5sbS5uaWguZ292L3BtYy9hcnRpY2xlcy9QTUMzNjkzMDkwL2Jpbi9BdWRpbzMuV0FW'.WAV”. The mapping is unique and reversible – the source URL can be found by

echo 'aHR0cDovL3d3dy5uY2JpLm5sbS5uaWguZ292L3BtYy9hcnRpY2xlcy9QTUMzNjkzMDkwL2Jpbi9BdWRpbzMuV0FW' | base64 -d
which yields
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3693090/bin/Audio3.WAV
.

erlehmann commented 11 years ago

Fixed as of 1da13d84fb60853bc52d0a31a75147ef60ac4cd9.

Daniel-Mietchen commented 11 years ago

Technically, this bug is solved in that the downloaded videos now get unique file names on our server but the system you have gone for is not human-readable and thus hard to debug, or even to notice or report errors:

Downloading <http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3751826/bin/pone.0072924.s002.mov>, saving as “/home/danielmietchen/.cache/open-access-media-importer/media/raw/pmc_pmcid/aHR0cDovL3d3dy5uY2JpLm5sbS5uaWguZ292L3BtYy9hcnRpY2xlcy9QTUMzNzUxODI2L2Jpbi9wb25lLjAwNzI5MjQuczAwMi5tb3Y=.mov” …

I think it would be more straightforward to go for something like

Downloading <http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3751826/bin/pone.0072924.s002.mov>, saving as “/home/danielmietchen/.cache/open-access-media-importer/media/raw/pmc_pmcid/PMC3751826/pone.0072924.s002.mov

And in case several files from a particular paper have identical names, then we could just number them.

erlehmann commented 11 years ago

I propose to use percent encoded URLs as file names. Percent encoding replaces special characters, but the resulting string is still quite readable: http://en.wikipedia.org/wiki/Percent-encoding#Percent-encoding_reserved_characters

Python example:

>>> from urllib import quote, unquote
>>> quote('http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3751826/bin/pone.0072924.s002.mov', safe='')
'http%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2FPMC3751826%2Fbin%2Fpone.0072924.s002.mov'
>>> unquote('http%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2FPMC3751826%2Fbin%2Fpone.0072924.s002.mov')
'http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3751826/bin/pone.0072924.s002.mov'
Daniel-Mietchen commented 11 years ago

Fine with me.

erlehmann commented 11 years ago

Using percent encoding as of 3dbf2355e1592b9df7889b5d4640a7cb2fd5fbab.

Daniel-Mietchen commented 11 years ago

This works fine, thanks.