Closed Daniel-Mietchen closed 11 years ago
The files from that example article are now up: http://commons.wikimedia.org/wiki/File:Laminar-firing-and-membrane-dynamics-in-four-visual-areas-exposed-to-two-objects-moving-to-occlusion-Movie1.ogv
Perhaps it's best to always take the filename as it is and then move (in the cache directory) to a new name that contains the part of the DOI that comes after the slash. If we do this after the upload, the current workflow would not even have to be modified unless for cases when the conversion (cf. https://github.com/erlehmann/open-access-media-importer/issues?labels=GStreamer&page=1&state=open ) or upload (cf. https://github.com/erlehmann/open-access-media-importer/issues/22 ) fails.
For the upload, we can stick to the file name we are using now, since the article title is probably good enough for disambiguation.
For temporary storage, I am going to Base64-encode the URL of the file and append the file name extension. This should be unique and useful enough at the same time.
This proposal means that the local filename for http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3693090/bin/Audio3.WAV becomes “aHR0cDovL3d3dy5uY2JpLm5sbS5uaWguZ292L3BtYy9hcnRpY2xlcy9QTUMzNjkzMDkwL2Jpbi9BdWRpbzMuV0FW'.WAV”. The mapping is unique and reversible – the source URL can be found by
echo 'aHR0cDovL3d3dy5uY2JpLm5sbS5uaWguZ292L3BtYy9hcnRpY2xlcy9QTUMzNjkzMDkwL2Jpbi9BdWRpbzMuV0FW' | base64 -dwhich yields
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3693090/bin/Audio3.WAV.
Fixed as of 1da13d84fb60853bc52d0a31a75147ef60ac4cd9.
Technically, this bug is solved in that the downloaded videos now get unique file names on our server but the system you have gone for is not human-readable and thus hard to debug, or even to notice or report errors:
Downloading <http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3751826/bin/pone.0072924.s002.mov>, saving as “/home/danielmietchen/.cache/open-access-media-importer/media/raw/pmc_pmcid/aHR0cDovL3d3dy5uY2JpLm5sbS5uaWguZ292L3BtYy9hcnRpY2xlcy9QTUMzNzUxODI2L2Jpbi9wb25lLjAwNzI5MjQuczAwMi5tb3Y=.mov” …
I think it would be more straightforward to go for something like
Downloading <http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3751826/bin/pone.0072924.s002.mov>, saving as “/home/danielmietchen/.cache/open-access-media-importer/media/raw/pmc_pmcid/PMC3751826/pone.0072924.s002.mov
And in case several files from a particular paper have identical names, then we could just number them.
I propose to use percent encoded URLs as file names. Percent encoding replaces special characters, but the resulting string is still quite readable: http://en.wikipedia.org/wiki/Percent-encoding#Percent-encoding_reserved_characters
Python example:
>>> from urllib import quote, unquote >>> quote('http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3751826/bin/pone.0072924.s002.mov', safe='') 'http%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2FPMC3751826%2Fbin%2Fpone.0072924.s002.mov' >>> unquote('http%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2FPMC3751826%2Fbin%2Fpone.0072924.s002.mov') 'http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3751826/bin/pone.0072924.s002.mov'
Fine with me.
Using percent encoding as of 3dbf2355e1592b9df7889b5d4640a7cb2fd5fbab.
This works fine, thanks.
Most files have a name that is unique enough to be used for duplicate detection in the respective cache directory. Some, however, are not.
I am pasting in an example below. It features a "Movie1.MP4", and since the cache directory already has a "Movie1.MP4.ogv", the duplicate detection assumes the file has already been converted. Alas, it hasn't, and the script then attempts to upload the old file (which is over 300 MB and thus not uploaded, as per https://github.com/erlehmann/open-access-media-importer/issues/22 ). I have no idea how many such false uploads have already happened, but I would suspect on the order of 10.
I have provisionally renamed the original one into Movie1.MP4-306693161.ogv (the number is simply its size - did not see an easy way to determine the DOI) but we should check and modify the workflows here, so as to ensure that the files in the cache directories always have unique file names - preferably with a good overlap to the DOI.
So 30. Jun 00:34:34 CEST 2013 doi: 10.3389/fnsys.2013.00023 Removing “/home/danielmietchen/.local/share/open-access-media-importer/pmc_doi.sqlite” … done. Input DOIs, delimited by whitespace: Getting PubMed Central IDs for given DOIs … found: 3691547 Downloading “http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pmc&id=3691547”, saving into directory “/home/danielmietchen/.cache/open-access-media-importer/metadata/raw/pmc_doi” … 100% |#########################################################################################################################################################| /usr/lib/python2.7/dist-packages/sqlalchemy/engine/default.py:463: SAWarning: Unicode type received non-unicode bind param value. param.append(processorskey) “Laminar firing and membrane dynamics in four visual areas exposed to two objects moving to occlusion”: 3 × video/quicktime
Checking MIME types … DOI 10.3389/fnsys.2013.00023, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3691547/bin/Movie1.MP4, source claimed video/quicktime but is video/mp4.ETA: 00:00:00 /usr/lib/python2.7/dist-packages/sqlalchemy/engine/default.py:463: SAWarning: Unicode type received non-unicode bind param value. param.append(processorskey) 3 of 3 100% |###################################################################################################################################| Time: 00:00:04 DOI 10.3389/fnsys.2013.00023, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3691547/bin/Movie3.MPG, source claimed video/quicktime but is video/mpeg. Downloading http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3691547/bin/Movie1.MP4, saving into directory “/home/danielmietchen/.cache/open-access-media-importer/media/raw/pmc_doi” … 100% |#########################################################################################################################################################| Downloading http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3691547/bin/Movie2.MOV, saving into directory “/home/danielmietchen/.cache/open-access-media-importer/media/raw/pmc_doi” … 100% |#########################################################################################################################################################| Downloading http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3691547/bin/Movie3.MPG, saving into directory “/home/danielmietchen/.cache/open-access-media-importer/media/raw/pmc_doi” … 100% |#########################################################################################################################################################| Skipping conversion of “/home/danielmietchen/.cache/open-access-media-importer/media/raw/pmc_doi/Movie1.MP4”, exists at “/home/danielmietchen/.cache/open-access-media-importer/media/refined/pmc_doi/Movie1.MP4.ogv”. Converting “/home/danielmietchen/.cache/open-access-media-importer/media/raw/pmc_doi/Movie2.MOV”, saving into “/home/danielmietchen/.cache/open-access-media-importer/media/refined/pmc_doi/Movie2.MOV.ogv” … 9% |############### done.|####################################################################################################################################################### | Converting “/home/danielmietchen/.cache/open-access-media-importer/media/raw/pmc_doi/Movie3.MPG”, saving into “/home/danielmietchen/.cache/open-access-media-importer/media/refined/pmc_doi/Movie3.MPG.ogv” … report (00:00:05): 10 / 19 seconds (52,6 %) report (00:00:08): 18 / 19 seconds (94,7 %) done. Authenticating with http://commons.wikimedia.org/w/api.php. ^CTraceback (most recent call last): File "./oa-put", line 111, in
mediawiki.upload(media_refined_path, wiki_filename, page_template)
File "/home/danielmietchen/open-access-media-importer/helpers/mediawiki.py", line 116, in upload
comment = 'Automatically uploaded media file from [[:en:Open access|Open Access]] source. Please report problems or suggestions [[User talk:Open Access Media Importer Bot|here]].'
File "/home/danielmietchen/open-access-media-importer/helpers/wikitools/wikifile.py", line 228, in upload
req = api.APIRequest(self.site, params, write=True, multipart=bool(fileobj))
File "/home/danielmietchen/open-access-media-importer/helpers/wikitools/api.py", line 71, in init
self.encodeddata = self.encodeddata + singledata
KeyboardInterrupt
danielmietchen@files:~/open-access-media-importer$ ls -l /home/danielmietchen/.cache/open-access-media-importer/media/refined/pmc_doi/Movie1.MP4.ogv -rw-r--r-- 1 danielmietchen danielmietchen 306693161 Jun 6 04:51 /home/danielmietchen/.cache/open-access-media-importer/media/refined/pmc_doi/Movie1.MP4.ogv danielmietchen@files:~/open-access-media-importer$ mv /home/danielmietchen/.cache/open-access-media-importer/media/refined/pmc_doi/Movie1.MP4.ogv /home/danielmietchen/.cache/open-access-media-importer/media/refined/pmc_doi/Movie1.MP4-306693161.ogv