wpoa / open-access-media-importer

A tool for harvesting media files from Open Access articles for upload into Wikimedia Commons
http://commons.wikimedia.org/wiki/User:Open_Access_Media_Importer_Bot
23 stars 8 forks source link

"Invalid page at offset X" error on Commons #112

Open Daniel-Mietchen opened 10 years ago

Daniel-Mietchen commented 10 years ago

Not sure whether that's a problem on our end, but to make sure, I'll post here anyway: a number of videos have always failed in thumbnail creation, but I have never seen it that (six) consecutive files from four different papers had been affected, as was the case with https://commons.wikimedia.org/wiki/File:How-Do-Ants-Make-Sense-of-Gravity-A-Boltzmann-Walker-Analysis-of-Lasius-niger-Trajectories-on-pone.0076531.s002.ogv

https://commons.wikimedia.org/wiki/File:Microcomputed-Tomography-with-Diffraction-Enhanced-Imaging-for-Morphologic-Characterization-and-pone.0078176.s002.ogv

https://commons.wikimedia.org/wiki/File:Microcomputed-Tomography-with-Diffraction-Enhanced-Imaging-for-Morphologic-Characterization-and-pone.0078176.s003.ogv

https://commons.wikimedia.org/wiki/File:Microcomputed-Tomography-with-Diffraction-Enhanced-Imaging-for-Morphologic-Characterization-and-pone.0078176.s004.ogv

https://commons.wikimedia.org/wiki/File:Perception-of-Elasticity-in-the-Kinetic-Illusory-Object-with-Phase-Differences-in-Inducer-Motion-pone.0078621.s001.ogv

https://commons.wikimedia.org/wiki/File:Cyclic-Tensile-Strain-Controls-Cell-Shape-and-Directs-Actin-Stress-Fiber-Formation-and-Focal-pone.0077328.s007.ogv

All of them have been uploaded in the morning of Nov 13.

Daniel-Mietchen commented 10 years ago

Bug filed at https://bugzilla.wikimedia.org/show_bug.cgi?id=57048 .

Daniel-Mietchen commented 10 years ago

Further details at https://commons.wikimedia.org/w/index.php?title=User_talk:Open_Access_Media_Importer_Bot&diff=next&oldid=109056863 - seems to be due to frame sizes not being mod 4.

Daniel-Mietchen commented 10 years ago

I stopped the bot for the time being, so that we can avoid uploading more of those incorrectly encoded files before we find out what the problem is.

Daniel-Mietchen commented 10 years ago

Further affected files: https://commons.wikimedia.org/wiki/File:Expansion-of-the-Gateway-MultiSite-Recombination-Cloning-Toolkit-pone.0077724.s002.ogv

https://commons.wikimedia.org/wiki/File:Expansion-of-the-Gateway-MultiSite-Recombination-Cloning-Toolkit-pone.0077724.s003.ogv

https://commons.wikimedia.org/wiki/File:Correlated-Spontaneous-Activity-Persists-in-Adult-Retina-and-Is-Suppressed-by-Inhibitory-Inputs-pone.0077658.s001.ogv

https://commons.wikimedia.org/wiki/File:Generation-of-BAC-Transgenic-Epithelial-Organoids-pone.0076871.s001.ogv

https://commons.wikimedia.org/wiki/File:Generation-of-BAC-Transgenic-Epithelial-Organoids-pone.0076871.s003.ogv

https://commons.wikimedia.org/wiki/File:Perception-of-Elasticity-in-the-Kinetic-Illusory-Object-with-Phase-Differences-in-Inducer-Motion-pone.0078621.s004.ogv

https://commons.wikimedia.org/wiki/File:In-Vivo-Efficacy-of-Compliant-3D-Nano-Composite-in-Critical-Size-Bone-Defect-Repair-a-Six-Month-pone.0077578.s003.ogv

https://commons.wikimedia.org/wiki/File:Emergence-of-Metastable-State-Dynamics-in-Interconnected-Cortical-Networks-with-Propagation-Delays-pcbi.1003304.s011.ogv

Interestingly, a number of other files - sometimes even from the same papers - seem to be unaffected: https://commons.wikimedia.org/wiki/File:Expansion-of-the-Gateway-MultiSite-Recombination-Cloning-Toolkit-pone.0077724.s001.ogv

https://commons.wikimedia.org/wiki/File:Expansion-of-the-Gateway-MultiSite-Recombination-Cloning-Toolkit-pone.0077724.s004.ogv

https://commons.wikimedia.org/wiki/File:Generation-of-BAC-Transgenic-Epithelial-Organoids-pone.0076871.s002.ogv

https://commons.wikimedia.org/wiki/File:Ligation-of-Signal-Inhibitory-Receptor-on-Leukocytes-1-Suppresses-the-Release-of-Neutrophil-pone.0078459.s001.ogv

https://commons.wikimedia.org/wiki/File:A-High-Content-Small-Molecule-Screen-Identifies-Sensitivity-of-Glioblastoma-Stem-Cells-to-pone.0077053.s010.ogv

erlehmann commented 10 years ago

I'll try padding with black pixels. Failing that, i'll try scaling.

erlehmann commented 10 years ago

Data corruption confirmed.

# 08 /tmp 
; wget 'https://upload.wikimedia.org/wikipedia/commons/2/2b/Expansion-of-the-Gateway-MultiSite-Recombination-Cloning-Toolkit-pone.0077724.s002.ogv'
--2013-11-18 00:41:35--  https://upload.wikimedia.org/wikipedia/commons/2/2b/Expansion-of-the-Gateway-MultiSite-Recombination-Cloning-Toolkit-pone.0077724.s002.ogv
p11-kit: invalid config filename, will be ignored in the future: /etc/pkcs11/modules/gnome-keyring-module
WARNING: gnome-keyring:: couldn't connect to: /home/erlehmann/.cache/keyring-FN4lmc/pkcs11: Datei oder Verzeichnis nicht gefunden
p11-kit: failed to initialize module: gnome-keyring-module: Auf dem Gerät trat ein Fehler auf.
Auflösen des Hostnamen »upload.wikimedia.org (upload.wikimedia.org)«... 91.198.174.208, 2620:0:862:ed1a::2:b
Verbindungsaufbau zu upload.wikimedia.org (upload.wikimedia.org)|91.198.174.208|:443... verbunden.
HTTP-Anforderung gesendet, warte auf Antwort... 200 OK
Länge: 182501 (178K) [application/ogg]
In »»Expansion-of-the-Gateway-MultiSite-Recombination-Cloning-Toolkit-pone.0077724.s002.ogv«« speichern.
100%[======================================>] 182.501     98,5KB/s   in 1,8s   
2013-11-18 00:41:42 (98,5 KB/s) - »»Expansion-of-the-Gateway-MultiSite-Recombination-Cloning-Toolkit-pone.0077724.s002.ogv«« gespeichert [182501/182501]
# 09 /tmp 
; ogginfo Expansion-of-the-Gateway-MultiSite-Recombination-Cloning-Toolkit-pone.0077724.s002.ogv 
Processing file "Expansion-of-the-Gateway-MultiSite-Recombination-Cloning-Toolkit-pone.0077724.s002.ogv"...
New logical stream (#1, serial: 3edfb961): type theora
Theora headers parsed for stream 1, information follows...
Version: 3.2.1
Vendor: Xiph.Org libtheora 1.1 20090822 (Thusnelda)
Width: 352
Height: 288
Total image: 352 by 288, crop offset (0, 0)
Framerate 30/1 (30,00 fps)
Pixel aspect ratio 1:1 (1,000000:1)
Frame aspect 1,222222:1
Colourspace unspecified
Pixel format 4:2:0
Target bitrate: 0 kbps
Nominal quality setting (0-63): 48
User comments section follows...
    title=
    album=Expansion of the Gateway MultiSite Recombination Cloning Toolkit
    artist=Shearin H, Dvarishkis A, Kozeluh C, Stowers R
    copyrights=Shearin et al
    license=http://creativecommons.org/licenses/by/3.0/
    description=Accordion behavior is elicited by larva expressing Chr2 T159C-HA under 13XLexAop2 control in class III sensory neurons and chordotonal organs upon blue light stimulation. Genotype: yw; nompC-LexAp65/13XLexAop2-Chr2 T159C-HA. The presence of the blue rectangle indicates larvae are being stimulated with blue light.
    date=2013
WARNING: Hole in data (69 bytes) found at approximate offset 36000 bytes. Corrupted Ogg.
WARNING: Hole in data (24 bytes) found at approximate offset 36000 bytes. Corrupted Ogg.
WARNING: Hole in data (68 bytes) found at approximate offset 36000 bytes. Corrupted Ogg.
WARNING: Hole in data (140 bytes) found at approximate offset 36000 bytes. Corrupted Ogg.
WARNING: Hole in data (19 bytes) found at approximate offset 36000 bytes. Corrupted Ogg.
WARNING: Hole in data (3 bytes) found at approximate offset 36000 bytes. Corrupted Ogg.
WARNING: Hole in data (147 bytes) found at approximate offset 36000 bytes. Corrupted Ogg.
WARNING: Hole in data (214 bytes) found at approximate offset 36000 bytes. Corrupted Ogg.
WARNING: Hole in data (150 bytes) found at approximate offset 36000 bytes. Corrupted Ogg.
WARNING: Hole in data (111 bytes) found at approximate offset 36000 bytes. Corrupted Ogg.
WARNING: Hole in data (359 bytes) found at approximate offset 36000 bytes. Corrupted Ogg.
WARNING: Hole in data (75 bytes) found at approximate offset 36000 bytes. Corrupted Ogg.
WARNING: Hole in data (28 bytes) found at approximate offset 36000 bytes. Corrupted Ogg.
WARNING: Hole in data (401 bytes) found at approximate offset 36000 bytes. Corrupted Ogg.
WARNING: Hole in data (26 bytes) found at approximate offset 36000 bytes. Corrupted Ogg.
WARNING: Hole in data (242 bytes) found at approximate offset 36000 bytes. Corrupted Ogg.
WARNING: Hole in data (12 bytes) found at approximate offset 36000 bytes. Corrupted Ogg.
WARNING: Hole in data (57 bytes) found at approximate offset 36000 bytes. Corrupted Ogg.
WARNING: Hole in data (221 bytes) found at approximate offset 36000 bytes. Corrupted Ogg.
WARNING: Hole in data (73 bytes) found at approximate offset 36000 bytes. Corrupted Ogg.
WARNING: Hole in data (5 bytes) found at approximate offset 36000 bytes. Corrupted Ogg.
WARNING: Hole in data (519 bytes) found at approximate offset 36000 bytes. Corrupted Ogg.
WARNING: Hole in data (455 bytes) found at approximate offset 36000 bytes. Corrupted Ogg.
WARNING: Hole in data (362 bytes) found at approximate offset 36000 bytes. Corrupted Ogg.
WARNING: sequence number gap in stream 1. Got page 8 when expecting page 7. Indicates missing data.
WARNING: discontinuity in stream (1)
WARNING: Expected frame 41, got 54
Theora stream 1:
    Total data length: 175373 bytes
    Playback length: 0m:08.833s
    Average bitrate: 158,828377 kb/s
Logical stream 1 ended
# exited 1
erlehmann commented 10 years ago

strangely, oggz finds no fault with the file.

# 0a /tmp 
; oggz-info Expansion-of-the-Gateway-MultiSite-Recombination-Cloning-Toolkit-pone.0077724.s002.ogv 
Content-Duration: 00:00:08.800
Theora: serialno 1054849377
    255 packets in 37 pages, 6.9 packets/page, 1.020% Ogg overhead
    Theora-Version: 3.2.1
    Video-Framerate: 30.000 fps
    Video-Width: 352
    Video-Height: 288
# 0b /tmp 
; oggz-validate Expansion-of-the-Gateway-MultiSite-Recombination-Cloning-Toolkit-pone.0077724.s002.ogv 
erlehmann commented 10 years ago

OggThumb fails with first frame. It does, however, not fail with all frames (frame 10 is ok, frame 100 is not):


# 10 /tmp

; oggThumb -f1 Expansion-of-the-Gateway-MultiSite-Recombination-Cloning-Toolkit-pone.0077724.s002.ogv 
Creating thumbs under the following option:
Frame numbers: 1  
file type: .jpg
The following ogg media files will be used: Expansion-of-the-Gateway-MultiSite-Recombination-Cloning-Toolkit-pone.0077724.s002.ogv 
Info:
Theora Decoder Configuration:

Theora Version   : 3.2.1

Video Size       : 352 x 288
Keyframe Shift   : 64 frames 
Aspect Ratio     : 1 : 1
Framerate        : 30 / 1

Quality          : 48 / 64
Datarate         : 0
Pixel Format     : 420 (Chroma decimination by 2 in both directions)
Colorspace       : unspecified

width: 352 and height: 288
 0.8     Fatal error: OggRingbuffer::getNextPageLength: ERROR ogg packet not aligned

# exited 255

            
erlehmann commented 10 years ago

Media conversion works on my laptop. Odd.

; ./oami-converter-test.py 'http://www.plosone.org/article/fetchSingleRepresentation.action?uri=info:doi/10.1371/journal.pone.0077724.s002'
Getting file from , writing to “oami-gstreamer-test-input” … done.
Setting up Media helper for “oami-gstreamer-test-input”… done.
Attempting finding streams of “oami-gstreamer-test-input” … done.
Attempting conversion of “oami-gstreamer-test-input”, writing into “oami-gstreamer-test-output” …   3% |##                                                      done.|#######################################################################  |
# 16 ~/src/open-access-media-importer ~? master(2013.1-62-g67abe95) = origin/master
; file oami-gstreamer-test-input 
oami-gstreamer-test-input: ISO Media, MPEG v4 system, version 2
# 17 ~/src/open-access-media-importer ~? master(2013.1-62-g67abe95) = origin/master
; file oami-gstreamer-test-output 
oami-gstreamer-test-output: Ogg data, Theora video
# 18 ~/src/open-access-media-importer ~? master(2013.1-62-g67abe95) = origin/master
; ogginfo oami-gstreamer-test-output 
Processing file "oami-gstreamer-test-output"...
New logical stream (#1, serial: 35025ff0): type theora
Theora headers parsed for stream 1, information follows...
Version: 3.2.1
Vendor: Xiph.Org libtheora 1.1 20090822 (Thusnelda)
Width: 352
Height: 288
Total image: 352 by 288, crop offset (0, 0)
Framerate 30/1 (30,00 fps)
Pixel aspect ratio 1:1 (1,000000:1)
Frame aspect 1,222222:1
Colourspace unspecified
Pixel format 4:2:0
Target bitrate: 0 kbps
Nominal quality setting (0-63): 48
Theora stream 1:
    Total data length: 154978 bytes
    Playback length: 0m:07.733s
    Average bitrate: 160,322069 kb/s
Logical stream 1 ended
erlehmann commented 10 years ago

Test conversion using my laptop:

# 20 ~/src/open-access-media-importer ~? master(2013.1-62-g67abe95) = origin/master
; echo '10.1371/journal.pone.0077724' | ./oami_pmc_doi_import
Input DOIs, delimited by whitespace: Getting PubMed Central IDs for given DOIs … found: 3799639
Downloading “http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pmc&id=3799639”, saving into directory “/home/erlehmann/.cache/open-access-media-importer/metadata/raw/pmc_doi” …
100% |#########################################################################|
Skipping 2 records … 
Checking MIME types …
No materials found where MIME type has to be checked.
Downloading , saving as “/home/erlehmann/.cache/open-access-media-importer/media/raw/pmc_doi/http%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2FPMC3561373%2Fbin%2Fpone.0053183.s003.avi” …
100% |#########################################################################|
Downloading , saving as “/home/erlehmann/.cache/open-access-media-importer/media/raw/pmc_doi/http%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2FPMC3561373%2Fbin%2Fpone.0053183.s004.avi” …
100% |#########################################################################|
Downloading , saving as “/home/erlehmann/.cache/open-access-media-importer/media/raw/pmc_doi/http%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2FPMC3799639%2Fbin%2Fpone.0077724.s001.mp4” …
100% |#########################################################################|
Downloading , saving as “/home/erlehmann/.cache/open-access-media-importer/media/raw/pmc_doi/http%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2FPMC3799639%2Fbin%2Fpone.0077724.s002.mp4” …
100% |#########################################################################|
Downloading , saving as “/home/erlehmann/.cache/open-access-media-importer/media/raw/pmc_doi/http%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2FPMC3799639%2Fbin%2Fpone.0077724.s003.mp4” …
100% |#########################################################################|
Downloading , saving as “/home/erlehmann/.cache/open-access-media-importer/media/raw/pmc_doi/http%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2FPMC3799639%2Fbin%2Fpone.0077724.s004.mp4” …
100% |#########################################################################|
Downloading , saving as “/home/erlehmann/.cache/open-access-media-importer/media/raw/pmc_doi/http%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2FPMC3799639%2Fbin%2Fpone.0077724.s005.mp4” …
100% |#########################################################################|
Downloading , saving as “/home/erlehmann/.cache/open-access-media-importer/media/raw/pmc_doi/http%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2FPMC3799639%2Fbin%2Fpone.0077724.s006.mp4” …
100% |#########################################################################|
Converting “/home/erlehmann/.cache/open-access-media-importer/media/raw/pmc_doi/http%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2FPMC3561373%2Fbin%2Fpone.0053183.s003.avi”, saving into “/home/erlehmann/.cache/open-access-media-importer/media/refined/pmc_doi/http%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2FPMC3561373%2Fbin%2Fpone.0053183.s003.avi.ogg” …   4% |###                             done.|######################################################################## |
Converting “/home/erlehmann/.cache/open-access-media-importer/media/raw/pmc_doi/http%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2FPMC3561373%2Fbin%2Fpone.0053183.s004.avi”, saving into “/home/erlehmann/.cache/open-access-media-importer/media/refined/pmc_doi/http%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2FPMC3561373%2Fbin%2Fpone.0053183.s004.avi.ogg” …   4% |###                             done.|######################################################################## |
Converting “/home/erlehmann/.cache/open-access-media-importer/media/raw/pmc_doi/http%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2FPMC3799639%2Fbin%2Fpone.0077724.s001.mp4”, saving into “/home/erlehmann/.cache/open-access-media-importer/media/refined/pmc_doi/http%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2FPMC3799639%2Fbin%2Fpone.0077724.s001.mp4.ogg” …   1% |#                               done.|######################################################################## |
Converting “/home/erlehmann/.cache/open-access-media-importer/media/raw/pmc_doi/http%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2FPMC3799639%2Fbin%2Fpone.0077724.s002.mp4”, saving into “/home/erlehmann/.cache/open-access-media-importer/media/refined/pmc_doi/http%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2FPMC3799639%2Fbin%2Fpone.0077724.s002.mp4.ogg” …   3% |##                              done.|######################################################################## |
Converting “/home/erlehmann/.cache/open-access-media-importer/media/raw/pmc_doi/http%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2FPMC3799639%2Fbin%2Fpone.0077724.s003.mp4”, saving into “/home/erlehmann/.cache/open-access-media-importer/media/refined/pmc_doi/http%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2FPMC3799639%2Fbin%2Fpone.0077724.s003.mp4.ogg” …   4% |###                             done.|######################################################################   |
Converting “/home/erlehmann/.cache/open-access-media-importer/media/raw/pmc_doi/http%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2FPMC3799639%2Fbin%2Fpone.0077724.s004.mp4”, saving into “/home/erlehmann/.cache/open-access-media-importer/media/refined/pmc_doi/http%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2FPMC3799639%2Fbin%2Fpone.0077724.s004.mp4.ogg” …   3% |##                              done.|#######################################################################  |
Converting “/home/erlehmann/.cache/open-access-media-importer/media/raw/pmc_doi/http%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2FPMC3799639%2Fbin%2Fpone.0077724.s005.mp4”, saving into “/home/erlehmann/.cache/open-access-media-importer/media/refined/pmc_doi/http%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2FPMC3799639%2Fbin%2Fpone.0077724.s005.mp4.ogg” …   2% |#                               done.|######################################################################## |
Converting “/home/erlehmann/.cache/open-access-media-importer/media/raw/pmc_doi/http%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2FPMC3799639%2Fbin%2Fpone.0077724.s006.mp4”, saving into “/home/erlehmann/.cache/open-access-media-importer/media/refined/pmc_doi/http%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2FPMC3799639%2Fbin%2Fpone.0077724.s006.mp4.ogg” …   2% |##                              done.|######################################################################## |
Authenticating with .
“/home/erlehmann/.cache/open-access-media-importer/media/refined/pmc_doi/http%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2FPMC3561373%2Fbin%2Fpone.0053183.s003.avi.ogg” uploaded to .
Authenticating with .
“/home/erlehmann/.cache/open-access-media-importer/media/refined/pmc_doi/http%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2FPMC3561373%2Fbin%2Fpone.0053183.s004.avi.ogg” uploaded to .
Authenticating with .
“/home/erlehmann/.cache/open-access-media-importer/media/refined/pmc_doi/http%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2FPMC3799639%2Fbin%2Fpone.0077724.s001.mp4.ogg” uploaded to .
Authenticating with .
“/home/erlehmann/.cache/open-access-media-importer/media/refined/pmc_doi/http%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2FPMC3799639%2Fbin%2Fpone.0077724.s002.mp4.ogg” uploaded to .
Authenticating with .
“/home/erlehmann/.cache/open-access-media-importer/media/refined/pmc_doi/http%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2FPMC3799639%2Fbin%2Fpone.0077724.s003.mp4.ogg” uploaded to .
Authenticating with .
“/home/erlehmann/.cache/open-access-media-importer/media/refined/pmc_doi/http%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2FPMC3799639%2Fbin%2Fpone.0077724.s004.mp4.ogg” uploaded to .
Authenticating with .
“/home/erlehmann/.cache/open-access-media-importer/media/refined/pmc_doi/http%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2FPMC3799639%2Fbin%2Fpone.0077724.s005.mp4.ogg” uploaded to .
Authenticating with .
Throttled
“/home/erlehmann/.cache/open-access-media-importer/media/refined/pmc_doi/http%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2FPMC3799639%2Fbin%2Fpone.0077724.s006.mp4.ogg” uploaded to .
# exited 0 0
erlehmann commented 10 years ago

I am unable to reproduce the bug using my laptop:

# 26 /tmp 
; wget http://species-id.net/w/media/2/2b/Expansion-of-the-Gateway-MultiSite-Recombination-Cloning-Toolkit-pone.0077724.s002.ogv
; wget http://species-id.net/w/media/2/2b/Expansion-of-the-Gateway-MultiSite-Recombination-Cloning-Toolkit-pone.0077724.s002.ogv
--2013-11-18 03:28:08--  http://species-id.net/w/media/2/2b/Expansion-of-the-Gateway-MultiSite-Recombination-Cloning-Toolkit-pone.0077724.s002.ogv
Auflösen des Hostnamen »species-id.net (species-id.net)«... 160.45.63.55
Verbindungsaufbau zu species-id.net (species-id.net)|160.45.63.55|:80... verbunden.
HTTP-Anforderung gesendet, warte auf Antwort... 200 OK
Länge: 158326 (155K) [video/ogg]
In »»Expansion-of-the-Gateway-MultiSite-Recombination-Cloning-Toolkit-pone.0077724.s002.ogv.1«« speichern.
100%[======================================>] 158.326     95,7KB/s   in 1,6s   
2013-11-18 03:28:11 (95,7 KB/s) - »»Expansion-of-the-Gateway-MultiSite-Recombination-Cloning-Toolkit-pone.0077724.s002.ogv.1«« gespeichert [158326/158326]
# 27 /tmp 
; ogginfo Expansion-of-the-Gateway-MultiSite-Recombination-Cloning-Toolkit-pone.0077724.s002.ogv.1
Processing file "Expansion-of-the-Gateway-MultiSite-Recombination-Cloning-Toolkit-pone.0077724.s002.ogv.1"...
New logical stream (#1, serial: 1e48ea22): type theora
Theora headers parsed for stream 1, information follows...
Version: 3.2.1
Vendor: Xiph.Org libtheora 1.1 20090822 (Thusnelda)
Width: 352
Height: 288
Total image: 352 by 288, crop offset (0, 0)
Framerate 30/1 (30,00 fps)
Pixel aspect ratio 1:1 (1,000000:1)
Frame aspect 1,222222:1
Colourspace unspecified
Pixel format 4:2:0
Target bitrate: 0 kbps
Nominal quality setting (0-63): 48
User comments section follows...
    title=
    album=Expansion of the Gateway MultiSite Recombination Cloning Toolkit
    artist=Shearin H, Dvarishkis A, Kozeluh C, Stowers R
    copyrights=Shearin et al
    license=http://creativecommons.org/licenses/by/3.0/
    description=Accordion behavior is elicited by larva expressing Chr2 T159C-HA under 13XLexAop2 control in class III sensory neurons and chordotonal organs upon blue light stimulation. Genotype: yw; nompC-LexAp65/13XLexAop2-Chr2 T159C-HA. The presence of the blue rectangle indicates larvae are being stimulated with blue light.
    date=2013
Theora stream 1:
    Total data length: 154978 bytes
    Playback length: 0m:07.733s
    Average bitrate: 160,322069 kb/s
Logical stream 1 ended
# 28 /tmp 
; oggz-info Expansion-of-the-Gateway-MultiSite-Recombination-Cloning-Toolkit-pone.0077724.s002.ogv.1
Content-Duration: 00:00:07.700
Theora: serialno 0508095010
    235 packets in 34 pages, 6.9 packets/page, 1.036% Ogg overhead
    Theora-Version: 3.2.1
    Video-Framerate: 30.000 fps
    Video-Width: 352
    Video-Height: 288
# 29 /tmp 
; oggThumb -f1 Expansion-of-the-Gateway-MultiSite-Recombination-Cloning-Toolkit-pone.0077724.s002.ogv.1 
Creating thumbs under the following option:
Frame numbers: 1  
file type: .jpg
The following ogg media files will be used: Expansion-of-the-Gateway-MultiSite-Recombination-Cloning-Toolkit-pone.0077724.s002.ogv.1 
Info:
Theora Decoder Configuration:
Theora Version   : 3.2.1
Video Size       : 352 x 288
Keyframe Shift   : 64 frames 
Aspect Ratio     : 1 : 1
Framerate        : 30 / 1
Quality          : 48 / 64
Datarate         : 0
Pixel Format     : 420 (Chroma decimination by 2 in both directions)
Colorspace       : unspecified
width: 352 and height: 288
 7.73333  
Daniel-Mietchen commented 10 years ago

So it could be a server problem?

On Mon, Nov 18, 2013 at 3:25 AM, Nils Dagsson Moskopp < notifications@github.com> wrote:

Test conversion using my laptop:

20 ~/src/open-access-media-importer ~? master(2013.1-62-g67abe95) = origin/master

; echo '10.1371/journal.pone.0077724' | ./oami_pmc_doi_import Input DOIs, delimited by whitespace: Getting PubMed Central IDs for given DOIs … found: 3799639 Downloading “http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pmc&id=3799639”, saving into directory “/home/erlehmann/.cache/open-access-media-importer/metadata/raw/pmc_doi” … 100% |#########################################################################| Skipping 2 records … Checking MIME types … No materials found where MIME type has to be checked. Downloading , saving as “/home/erlehmann/.cache/open-access-media-importer/media/raw/pmc_doi/http%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2FPMC3561373%2Fbin%2Fpone.0053183.s003.avi” … 100% |#########################################################################| Downloading , saving as “/home/erlehmann/.cache/open-access-media-importer/media/raw/pmc_doi/http%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2FPMC3561373%2Fbin%2Fpone.0053183.s004.avi” … 100% |#########################################################################| Downloading , saving as “/home/erlehmann/.cache/open-access-media-importer/media/raw/pmc_doi/http%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2FPMC3799639%2Fbin%2Fpone.0077724.s001.mp4” … 100% |#########################################################################| Downloading , saving as “/home/erlehmann/.cache/open-access-media-importer/media/raw/pmc_doi/http%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2FPMC3799639%2Fbin%2Fpone.0077724.s002.mp4” … 100% |#########################################################################| Downloading , saving as “/home/erlehmann/.cache/open-access-media-importer/media/raw/pmc_doi/http%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2FPMC3799639%2Fbin%2Fpone.0077724.s003.mp4” … 100% |#########################################################################| Downloading , saving as “/home/erlehmann/.cache/open-access-media-importer/media/raw/pmc_doi/http%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2FPMC3799639%2Fbin%2Fpone.0077724.s004.mp4” … 100% |#########################################################################| Downloading , saving as “/home/erlehmann/.cache/open-access-media-importer/media/raw/pmc_doi/http%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2FPMC3799639%2Fbin%2Fpone.0077724.s005.mp4” … 100% |#########################################################################| Downloading , saving as “/home/erlehmann/.cache/open-access-media-importer/media/raw/pmc_doi/http%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2FPMC3799639%2Fbin%2Fpone.0077724.s006.mp4” … 100% |#########################################################################| Converting “/home/erlehmann/.cache/open-access-media-importer/media/raw/pmc_doi/http%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2FPMC3561373%2Fbin%2Fpone.0053183.s003.avi”, saving into “/home/erlehmann/.cache/open-access-media-importer/media/refined/pmc_doi/http%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2FPMC3561373%2Fbin%2Fpone.0053183.s003.avi.ogg” … 4% |### done.|######################################################################## | Converting “/home/erlehmann/.cache/open-access-media-importer/media/raw/pmc_doi/http%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2FPMC3561373%2Fbin%2Fpone.0053183.s004.avi”, saving into “/home/erlehmann/.cache/open-access-media-importer/media/refined/pmc_doi/http%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2FPMC3561373%2Fbin%2Fpone.0053183.s004.avi.ogg” … 4% |### done.|######################################################################## | Converting “/home/erlehmann/.cache/open-access-media-importer/media/raw/pmc_doi/http%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2FPMC3799639%2Fbin%2Fpone.0077724.s001.mp4”, saving into “/home/erlehmann/.cache/open-access-media-importer/media/refined/pmc_doi/http%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2FPMC3799639%2Fbin%2Fpone.0077724.s001.mp4.ogg” … 1% |# done.|######################################################################## | Converting “/home/erlehmann/.cache/open-access-media-importer/media/raw/pmc_doi/http%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2FPMC3799639%2Fbin%2Fpone.0077724.s002.mp4”, saving into “/home/erlehmann/.cache/open-access-media-importer/media/refined/pmc_doi/http%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2FPMC3799639%2Fbin%2Fpone.0077724.s002.mp4.ogg” … 3% |## done.|######################################################################## | Converting “/home/erlehmann/.cache/open-access-media-importer/media/raw/pmc_doi/http%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2FPMC3799639%2Fbin%2Fpone.0077724.s003.mp4”, saving into “/home/erlehmann/.cache/open-access-media-importer/media/refined/pmc_doi/http%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2FPMC3799639%2Fbin%2Fpone.0077724.s003.mp4.ogg” … 4% |### done.|###################################################################### | Converting “/home/erlehmann/.cache/open-access-media-importer/media/raw/pmc_doi/http%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2FPMC3799639%2Fbin%2Fpone.0077724.s004.mp4”, saving into “/home/erlehmann/.cache/open-access-media-importer/media/refined/pmc_doi/http%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2FPMC3799639%2Fbin%2Fpone.0077724.s004.mp4.ogg” … 3% |## done.|####################################################################### | Converting “/home/erlehmann/.cache/open-access-media-importer/media/raw/pmc_doi/http%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2FPMC3799639%2Fbin%2Fpone.0077724.s005.mp4”, saving into “/home/erlehmann/.cache/open-access-media-importer/media/refined/pmc_doi/http%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2FPMC3799639%2Fbin%2Fpone.0077724.s005.mp4.ogg” … 2% |# done.|######################################################################## | Converting “/home/erlehmann/.cache/open-access-media-importer/media/raw/pmc_doi/http%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2FPMC3799639%2Fbin%2Fpone.0077724.s006.mp4”, saving into “/home/erlehmann/.cache/open-access-media-importer/media/refined/pmc_doi/http%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2FPMC3799639%2Fbin%2Fpone.0077724.s006.mp4.ogg” … 2% |## done.|######################################################################## | Authenticating with . “/home/erlehmann/.cache/open-access-media-importer/media/refined/pmc_doi/http%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2FPMC3561373%2Fbin%2Fpone.0053183.s003.avi.ogg” uploaded to . Authenticating with . “/home/erlehmann/.cache/open-access-media-importer/media/refined/pmc_doi/http%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2FPMC3561373%2Fbin%2Fpone.0053183.s004.avi.ogg” uploaded to . Authenticating with . “/home/erlehmann/.cache/open-access-media-importer/media/refined/pmc_doi/http%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2FPMC3799639%2Fbin%2Fpone.0077724.s001.mp4.ogg” uploaded to . Authenticating with . “/home/erlehmann/.cache/open-access-media-importer/media/refined/pmc_doi/http%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2FPMC3799639%2Fbin%2Fpone.0077724.s002.mp4.ogg” uploaded to . Authenticating with . “/home/erlehmann/.cache/open-access-media-importer/media/refined/pmc_doi/http%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2FPMC3799639%2Fbin%2Fpone.0077724.s003.mp4.ogg” uploaded to . Authenticating with . “/home/erlehmann/.cache/open-access-media-importer/media/refined/pmc_doi/http%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2FPMC3799639%2Fbin%2Fpone.0077724.s004.mp4.ogg” uploaded to . Authenticating with . “/home/erlehmann/.cache/open-access-media-importer/media/refined/pmc_doi/http%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2FPMC3799639%2Fbin%2Fpone.0077724.s005.mp4.ogg” uploaded to . Authenticating with . Throttled “/home/erlehmann/.cache/open-access-media-importer/media/refined/pmc_doi/http%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2FPMC3799639%2Fbin%2Fpone.0077724.s006.mp4.ogg” uploaded to .

exited 0 0

— Reply to this email directly or view it on GitHubhttps://github.com/erlehmann/open-access-media-importer/issues/112#issuecomment-28671540 .

erlehmann commented 10 years ago

Verifying OAMI server and my machine had the same input:

# 0b ~/.cache/open-access-media-importer/media/raw/pmc_doi 
; for (f in `{ls | grep 0077724}) {md5sum $f}
080baf63789d94f75a6775d606e2c9d4  http%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2FPMC3799639%2Fbin%2Fpone.0077724.s001.mp4
00d4206a6997efddd13cba39d8b57c97  http%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2FPMC3799639%2Fbin%2Fpone.0077724.s002.mp4
367acee714f4ea65f05ad0b9abcbf406  http%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2FPMC3799639%2Fbin%2Fpone.0077724.s003.mp4
fe58259f70d16a100105998013053cc8  http%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2FPMC3799639%2Fbin%2Fpone.0077724.s004.mp4
23f02f7b3bc2b93a9020cac70e741296  http%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2FPMC3799639%2Fbin%2Fpone.0077724.s005.mp4
cac2956754488d0f0f91aec6bc4a6ec3  http%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2FPMC3799639%2Fbin%2Fpone.0077724.s006.mp4
erlehmann@files:/home/danielmietchen/.cache/open-access-media-importer/media/raw/pmc_pmcid$ for f in $(ls | grep 0077724); do md5sum $f; done
080baf63789d94f75a6775d606e2c9d4  http%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2FPMC3799639%2Fbin%2Fpone.0077724.s001.mp4
00d4206a6997efddd13cba39d8b57c97  http%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2FPMC3799639%2Fbin%2Fpone.0077724.s002.mp4
367acee714f4ea65f05ad0b9abcbf406  http%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2FPMC3799639%2Fbin%2Fpone.0077724.s003.mp4
fe58259f70d16a100105998013053cc8  http%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2FPMC3799639%2Fbin%2Fpone.0077724.s004.mp4
23f02f7b3bc2b93a9020cac70e741296  http%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2FPMC3799639%2Fbin%2Fpone.0077724.s005.mp4
cac2956754488d0f0f91aec6bc4a6ec3  http%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2FPMC3799639%2Fbin%2Fpone.0077724.s006.mp4
erlehmann commented 10 years ago

Testing conversion on the server:

erlehmann@files:~/open-access-media-importer$ ./oami-converter-test.py 'http://www.plosone.org/article/fetchSingleRepresentation.action?uri=info:doi/10.1371/journal.pone.0077724.s002'
Getting file from , writing to “oami-gstreamer-test-input” … done.
Setting up Media helper for “oami-gstreamer-test-input”… done.
Attempting finding streams of “oami-gstreamer-test-input” … done.
Attempting conversion of “oami-gstreamer-test-input”, writing into “oami-gstreamer-test-output” …   1% |                                                        done.|######################################################################## |

Checking validity locally:

# 11 /tmp 
; scp 'erlehmann@files.mi.ur.de:/home/erlehmann/open-access-media-importer/oami-gstreamer-test-output' .
oami-gstreamer-test-output                    100%  154KB  77.0KB/s   00:02    
# 12 /tmp 
; file oami-gstreamer-test-output 
oami-gstreamer-test-output: Ogg data, Theora video
# 13 /tmp 
; ogginfo oami-gstreamer-test-output 
Processing file "oami-gstreamer-test-output"...
New logical stream (#1, serial: 16ec137e): type theora
Theora headers parsed for stream 1, information follows...
Version: 3.2.1
Vendor: Xiph.Org libtheora 1.1 20090822 (Thusnelda)
Width: 352
Height: 288
Total image: 352 by 288, crop offset (0, 0)
Framerate 30/1 (30,00 fps)
Pixel aspect ratio 1:1 (1,000000:1)
Frame aspect 1,222222:1
Colourspace unspecified
Pixel format 4:2:0
Target bitrate: 0 kbps
Nominal quality setting (0-63): 48
Theora stream 1:
    Total data length: 154978 bytes
    Playback length: 0m:07.733s
    Average bitrate: 160,322069 kb/s
Logical stream 1 ended
# 14 /tmp 
; oggz info oami-gstreamer-test-output 
Content-Duration: 00:00:07.700
Theora: serialno 0384570238
    235 packets in 34 pages, 6.9 packets/page, 1.039% Ogg overhead
    Theora-Version: 3.2.1
    Video-Framerate: 30.000 fps
    Video-Width: 352
    Video-Height: 288
erlehmann commented 10 years ago

Confirmed that file on server was same as uploaded file.

# 15 /tmp 
; scp 'erlehmann@files.mi.ur.de:/home/danielmietchen/.cache/open-access-media-importer/media/refined/pmc_pmcid/http%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2FPMC3799639%2Fbin%2Fpone.0077724.s002.mp4.ogg' 'http%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2FPMC3799639%2Fbin%2Fpone.0077724.s002.mp4.ogg'
http%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Fart 100%  178KB  89.1KB/s   00:02    
# 16 /tmp 
; file 'http%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2FPMC3799639%2Fbin%2Fpone.0077724.s002.mp4.ogg'
http%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2FPMC3799639%2Fbin%2Fpone.0077724.s002.mp4.ogg: Ogg data, Theora video
# 17 /tmp 
; ogginfo 'http%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2FPMC3799639%2Fbin%2Fpone.0077724.s002.mp4.ogg'
Processing file "http%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2FPMC3799639%2Fbin%2Fpone.0077724.s002.mp4.ogg"...
New logical stream (#1, serial: 3edfb961): type theora
Theora headers parsed for stream 1, information follows...
Version: 3.2.1
Vendor: Xiph.Org libtheora 1.1 20090822 (Thusnelda)
Width: 352
Height: 288
Total image: 352 by 288, crop offset (0, 0)
Framerate 30/1 (30,00 fps)
Pixel aspect ratio 1:1 (1,000000:1)
Frame aspect 1,222222:1
Colourspace unspecified
Pixel format 4:2:0
Target bitrate: 0 kbps
Nominal quality setting (0-63): 48
User comments section follows...
    title=
    album=Expansion of the Gateway MultiSite Recombination Cloning Toolkit
    artist=Shearin H, Dvarishkis A, Kozeluh C, Stowers R
    copyrights=Shearin et al
    license=http://creativecommons.org/licenses/by/3.0/
    description=Accordion behavior is elicited by larva expressing Chr2 T159C-HA under 13XLexAop2 control in class III sensory neurons and chordotonal organs upon blue light stimulation. Genotype: yw; nompC-LexAp65/13XLexAop2-Chr2 T159C-HA. The presence of the blue rectangle indicates larvae are being stimulated with blue light.
    date=2013
WARNING: Hole in data (69 bytes) found at approximate offset 36000 bytes. Corrupted Ogg.
WARNING: Hole in data (24 bytes) found at approximate offset 36000 bytes. Corrupted Ogg.
WARNING: Hole in data (68 bytes) found at approximate offset 36000 bytes. Corrupted Ogg.
WARNING: Hole in data (140 bytes) found at approximate offset 36000 bytes. Corrupted Ogg.
WARNING: Hole in data (19 bytes) found at approximate offset 36000 bytes. Corrupted Ogg.
WARNING: Hole in data (3 bytes) found at approximate offset 36000 bytes. Corrupted Ogg.
WARNING: Hole in data (147 bytes) found at approximate offset 36000 bytes. Corrupted Ogg.
WARNING: Hole in data (214 bytes) found at approximate offset 36000 bytes. Corrupted Ogg.
WARNING: Hole in data (150 bytes) found at approximate offset 36000 bytes. Corrupted Ogg.
WARNING: Hole in data (111 bytes) found at approximate offset 36000 bytes. Corrupted Ogg.
WARNING: Hole in data (359 bytes) found at approximate offset 36000 bytes. Corrupted Ogg.
WARNING: Hole in data (75 bytes) found at approximate offset 36000 bytes. Corrupted Ogg.
WARNING: Hole in data (28 bytes) found at approximate offset 36000 bytes. Corrupted Ogg.
WARNING: Hole in data (401 bytes) found at approximate offset 36000 bytes. Corrupted Ogg.
WARNING: Hole in data (26 bytes) found at approximate offset 36000 bytes. Corrupted Ogg.
WARNING: Hole in data (242 bytes) found at approximate offset 36000 bytes. Corrupted Ogg.
WARNING: Hole in data (12 bytes) found at approximate offset 36000 bytes. Corrupted Ogg.
WARNING: Hole in data (57 bytes) found at approximate offset 36000 bytes. Corrupted Ogg.
WARNING: Hole in data (221 bytes) found at approximate offset 36000 bytes. Corrupted Ogg.
WARNING: Hole in data (73 bytes) found at approximate offset 36000 bytes. Corrupted Ogg.
WARNING: Hole in data (5 bytes) found at approximate offset 36000 bytes. Corrupted Ogg.
WARNING: Hole in data (519 bytes) found at approximate offset 36000 bytes. Corrupted Ogg.
WARNING: Hole in data (455 bytes) found at approximate offset 36000 bytes. Corrupted Ogg.
WARNING: Hole in data (362 bytes) found at approximate offset 36000 bytes. Corrupted Ogg.
WARNING: sequence number gap in stream 1. Got page 8 when expecting page 7. Indicates missing data.
WARNING: discontinuity in stream (1)
WARNING: Expected frame 41, got 54
Theora stream 1:
    Total data length: 175373 bytes
    Playback length: 0m:08.833s
    Average bitrate: 158,828377 kb/s
Logical stream 1 ended
# exited 1
# 18 /tmp 
; oggz-info 'http%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2FPMC3799639%2Fbin%2Fpone.0077724.s002.mp4.ogg'
Content-Duration: 00:00:08.800
Theora: serialno 1054849377
    255 packets in 37 pages, 6.9 packets/page, 1.020% Ogg overhead
    Theora-Version: 3.2.1
    Video-Framerate: 30.000 fps
    Video-Width: 352
    Video-Height: 288
# 19 /tmp 
; md5sum 'http%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2FPMC3799639%2Fbin%2Fpone.0077724.s002.mp4.ogg'
29493a330b07d7d60c69ae6cffeaa107  http%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2FPMC3799639%2Fbin%2Fpone.0077724.s002.mp4.ogg
# 1a /tmp 
; md5sum 'Expansion-of-the-Gateway-MultiSite-Recombination-Cloning-Toolkit-pone.0077724.s002.ogv'
29493a330b07d7d60c69ae6cffeaa107  Expansion-of-the-Gateway-MultiSite-Recombination-Cloning-Toolkit-pone.0077724.s002.ogv

This means the problem is within conversion or storage. Is the server file system known to be consistent?

Daniel-Mietchen commented 10 years ago

Do we use the newest version of ffmpeg2theora ? See https://commons.wikimedia.org/w/index.php?title=User_talk:Open_Access_Media_Importer_Bot&oldid=110310817#mod_4 .

erlehmann commented 10 years ago

No, we don't use ffmpeg2theora at all – we use GStreamer. http://en.wikipedia.org/wiki/GStreamer

erlehmann commented 10 years ago

Also, looking at the evidence, I do not believe that sizing mod 4 is an issue.

Daniel-Mietchen commented 10 years ago

Seems to have been due to https://github.com/erlehmann/open-access-media-importer/issues/113 , which is fixed now. Closing.

Daniel-Mietchen commented 10 years ago

Reopening, since two more cases came in recently, and with the new cron job, the database does not seem affected, so there must be other reasons for the problem. ‎https://commons.wikimedia.org/wiki/File:Concerted-Spatio-Temporal-Dynamics-of-Imported-DNA-and-ComE-DNA-Uptake-Protein-during-Gonococcal-ppat.1004043.s009.ogv (Invalid Ogg file: Cannot decode Ogg file: Invalid page at offset 4522)

https://commons.wikimedia.org/wiki/File:Automated-High-Throughput-Quantification-of-Mitotic-Spindle-Positioning-from-DIC-Movies-of-pone.0093718.s010.ogv ‎(Invalid Ogg file: Cannot decode Ogg file: Invalid page at offset 2359789)

https://commons.wikimedia.org/wiki/File:Proper-Actin-Ring-Formation-and-Septum-Constriction-Requires-Coordinated-Regulation-of-SIN-and-MOR-pgen.1004306.s009.ogv (Invalid Ogg file: Cannot decode Ogg file: Invalid page at offset 43196)

https://commons.wikimedia.org/wiki/File:Glutamate-Bound-NMDARs-Arising-from-In-Vivo-like-Network-Activity-Extend-Spatio-temporal-pcbi.1003590.s010.ogv ‎(Invalid Ogg file: Cannot decode Ogg file: Invalid page at offset 1885259)

Some more examples are listed at http://www.webcitation.org/6PKCYzZT4 .

Daniel-Mietchen commented 10 years ago

@RaphaelWimmer do you have an idea whether the issue may be on the server side, during conversion or storage?

Daniel-Mietchen commented 10 years ago

@erlehmann @RaphaelWimmer we are currently getting this error for several files a day, which is not sustainable to fix on the wiki end. Can you please take a closer look? If the issue is not resolved by the end of the week, I will stop the bot until it is.

Daniel-Mietchen commented 10 years ago

Examples are available from https://commons.wikimedia.org/wiki/Category:Videos_without_thumbnails , unless someone has fixed and reuploaded those files already.

RaphaelWimmer commented 10 years ago

I'll have a look tomorrow.

RaphaelWimmer commented 10 years ago

Short update:

I can NOT manually reproduce the corruption bug on the OAMI server using a simplified test case (convert file via helpers/media.py and add metadata using a mutagen code snippet from oa-cache). Running the same script locally on my laptop results in more or less the same file as on the server (text case in tags being different due to different mutagen versions; for some files, different libav versions on server/laptop result in different but valid files). Therefore, I would assume that Gstreamer is probably not to blame.

However, at least for one test case [1], the (error-free) version generated manually differs from the one uploaded to WMC in an interesting way:

a) Filesize: 156544 (good) vs. 340049 (bad) b) the first half of the bad file is nearly binary identical to the good file. Only at three places, they are different. Interestingly, the different blocks have the following lengths: 16384 Bytes, 12288 Bytes, 4096 Bytes - all multiples of 4096. c) from 0x1d00 (Byte 118784) on, the two files differ completely.

Apparently, in multiple cases individual 4096-Byte blocks of the file have been replaced with garbage or blocks from other files. This must have happened either on storing to disk at our server, during transfer (unlikely), or on storing to disk at WMC. Our server's disk partition has a blocksize of 4096 (which is a common blocksize for ext3 filesystems and not yet a smoking gun). The next step (tomorrow) would be to hook up another disk to the oami server and see if the errors persist. If they do not go away, it's probably either a race condition or something similar in our code (unlikely) or a problem on WMCs side.

[1] https://commons.wikimedia.org/wiki/File:Solitary-accessory-and-papillary-muscle-hypertrophy-manifested-as-dynamic-mid-wall-obstruction-and-1471-2261-14-34-S2.ogv

Daniel-Mietchen commented 10 years ago

I have just stopped the bot, since the problem had persisted till today.

erlehmann commented 10 years ago

Raphael, do we still have the files which were converted on the server? If so, we could likely eliminate one theory regarding where the error happened.

Daniel-Mietchen commented 10 years ago

Some more cases: https://commons.wikimedia.org/wiki/File:Pinpointing-retrovirus-entry-sites-in-cells-expressing-alternatively-spliced-receptor-isoforms-by-1742-4690-11-47-S10.ogv https://commons.wikimedia.org/wiki/File:Pinpointing-retrovirus-entry-sites-in-cells-expressing-alternatively-spliced-receptor-isoforms-by-1742-4690-11-47-S9.ogv https://commons.wikimedia.org/wiki/File:Pinpointing-retrovirus-entry-sites-in-cells-expressing-alternatively-spliced-receptor-isoforms-by-1742-4690-11-47-S8.ogv https://commons.wikimedia.org/wiki/File:Pinpointing-retrovirus-entry-sites-in-cells-expressing-alternatively-spliced-receptor-isoforms-by-1742-4690-11-47-S5.ogv https://commons.wikimedia.org/wiki/File:Pinpointing-retrovirus-entry-sites-in-cells-expressing-alternatively-spliced-receptor-isoforms-by-1742-4690-11-47-S4.ogv

https://commons.wikimedia.org/wiki/File:Dynamic-mechanisms-of-neuroligin-dependent-presynaptic-terminal-assembly-in-living-cortical-neurons-1749-8104-9-13-S1.ogv

notconfusing commented 10 years ago

Yeah that's odd. I see that if you click on the files they work. Maybe this is a bug for Commons Bugzilla?

Max Klein ‽ http://notconfusing.com/

On Mon, Jun 23, 2014 at 7:18 AM, Daniel Mietchen notifications@github.com wrote:

Some more cases:

https://commons.wikimedia.org/wiki/File:Pinpointing-retrovirus-entry-sites-in-cells-expressing-alternatively-spliced-receptor-isoforms-by-1742-4690-11-47-S10.ogv

https://commons.wikimedia.org/wiki/File:Pinpointing-retrovirus-entry-sites-in-cells-expressing-alternatively-spliced-receptor-isoforms-by-1742-4690-11-47-S9.ogv https://commons.wikimedia.org/wiki/File:Pinpointing-retrovirus-entry-sites-in-cells-expressing-alternatively-spliced-recepto r-isoforms-by-1742-4690-11-47-S8.ogv https://commons.wikimedia.org/wiki/File:Pinpointing-retrovirus-entry-sites-in-cells-expressing-alternatively-spliced-receptor-isoforms-by-1742-4690-11-47-S8.ogv

https://commons.wikimedia.org/wiki/File:Pinpointing-retrovirus-entry-sites-in-cells-expressing-alternatively-spliced-receptor-isoforms-by-1742-4690-11-47-S5.ogv

https://commons.wikimedia.org/wiki/File:Pinpointing-retrovirus-entry-sites-in-cells-expressing-alternatively-spliced-receptor-isoforms-by-1742-4690-11-47-S4.ogv

https://commons.wikimedia.org/wiki/File:Dynamic-mechanisms-of-neuroligin-dependent-presynaptic-terminal-assembly-in-living-cortical-neurons-1749-8104-9-13-S1.ogv

— Reply to this email directly or view it on GitHub https://github.com/wpoa/open-access-media-importer/issues/112#issuecomment-46850664 .

Daniel-Mietchen commented 10 years ago

We did file this one on Bugzilla, but it was concluded that the problem was likely on our end back then. Not sure for now. @RaphaelWimmer ? @erlehmann ?

Daniel-Mietchen commented 10 years ago

One more: https://commons.wikimedia.org/wiki/File:A-Digital-Framework-to-Build-Visualize-and-Analyze-a-Gene-Expression-Atlas-with-Cellular-Resolution-pcbi.1003670.s027.ogv .

Daniel-Mietchen commented 10 years ago

Another: https://commons.wikimedia.org/wiki/File:Computational-Model-of-Erratic-Arrhythmias-in-a-Cardiac-Cell-Network-The-Role-of-Gap-Junctions-pone.0100288.s001.ogv .

Daniel-Mietchen commented 10 years ago

Some more: https://commons.wikimedia.org/wiki/File:An-Expanded-Notch-Delta-Model-Exhibiting-Long-Range-Patterning-and-Incorporating-MicroRNA-Regulation-pcbi.1003655.s007.ogv https://commons.wikimedia.org/wiki/File:A-Digital-Framework-to-Build-Visualize-and-Analyze-a-Gene-Expression-Atlas-with-Cellular-Resolution-pcbi.1003670.s028.ogv https://commons.wikimedia.org/wiki/File:Foldscope-Origami-Based-Paper-Microscope-pone.0098781.s009.ogv

Daniel-Mietchen commented 10 years ago

Some more: https://commons.wikimedia.org/wiki/File:Lin-28-Regulates-Oogenesis-and-Muscle-Formation-in-Drosophila-melanogaster-pone.0101141.s001.ogv https://commons.wikimedia.org/wiki/File:Use-of-High-Frequency-Ultrasound-to-Monitor-Cervical-Lymph-Node-Alterations-in-Mice-pone.0100185.s004.ogv https://commons.wikimedia.org/wiki/File:Gliding-Swifts-Attain-Laminar-Flow-over-Rough-Wings-pone.0099901.s004.ogv

Daniel-Mietchen commented 10 years ago

Some more: https://commons.wikimedia.org/wiki/File:Rostro-Caudal-Inhibition-of-Hindlimb-Movements-in-the-Spinal-Cord-of-Mice-pone.0100865.s004.ogv https://commons.wikimedia.org/wiki/File:Photoactivated-Localization-Microscopy-with-Bimolecular-Fluorescence-Complementation-%28BiFC-PALM%29-pone.0100589.s011.ogv https://commons.wikimedia.org/wiki/File:MELK-is-an-oncogenic-kinase-essential-for-mitotic-progression-in-basal-like-breast-cancer-cells-elife01763v002.ogv https://commons.wikimedia.org/wiki/File:Neuronal-connectome-of-a-sensory-motor-circuit-for-visual-navigation-elife02730v002.ogv

Daniel-Mietchen commented 10 years ago

One more: https://commons.wikimedia.org/wiki/File:Predator-Prey-Interactions-between-Shell-Boring-Beetle-Larvae-and-Rock-Dwelling-Land-Snails-pone.0100366.s008.ogv

notconfusing commented 10 years ago

can @erlehmann fix this?

Max Klein ‽ http://notconfusing.com/

On Wed, Jul 2, 2014 at 5:05 AM, Daniel Mietchen notifications@github.com wrote:

One more:

https://commons.wikimedia.org/wiki/File:Predator-Prey-Interactions-between-Shell-Boring-Beetle-Larvae-and-Rock-Dwelling-Land-Snails-pone.0100366.s008.ogv

— Reply to this email directly or view it on GitHub https://github.com/wpoa/open-access-media-importer/issues/112#issuecomment-47767605 .

RaphaelWimmer commented 10 years ago

I'm currently looking into this issue, concentrating on the last file Daniel mentioned - called s008 from now on and its 'brother' s009 from the same publication.

Observations:

The log file (oami_pmc_pmcid_import.log) shows that the bot has tried to convert s008 and s009 on 2014-06-29 and 2014-06-30 but failed on conversion (we should definitively clean up the logging, so much garbage):

converting “/home/danielmietchen/.cache/open-access-media-importer/media/raw/pmc_pmcid/http%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2F    PMC4070943%2Fbin%2Fpone.0100366.s008.wmv”, saving into “/home/danielmietchen/.cache/open-access-media-importer/media/refined/pmc_pmcid/http%    3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2FPMC4070943%2Fbin%2Fpone.0100366.s008.wmv.ogg” …   0% |                                                                            |^MSkipping conversion of “/home/danielmietchen/.cache/open-access-media-importer/media/raw/pmc_pmcid/http    %3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2FPMC4070943%2Fbin%2Fpone.0100366.s008.wmv”, earlier attempt failed.                           

File "./oa-cache", line 146, in module> 
  f = mutagen.oggtheora.OggTheora(temporary_media_path) 
 File "/usr/lib/python2.7/dist-packages/mutagen/__init__.py", line 73, in __init__
             self.load(filename, *args, **kwargs)
               File "/usr/lib/python2.7/dist-packages/mutagen/ogg.py", line 438, in load
                   fileobj = file(filename, "rb")                
IOError: [Errno 2] Datei oder Verzeichnis nicht gefunden: '/home/danielmietchen/.cache/open-access-media-importer/media/refined/pmc_pmcid/current.ogg' 

On 2014-07-01, the bot nevertheless uploaded s008, skipping source download and conversion, apparently uploading the corrupted result of the failed conversion.

I do not know the codebase well enough to be able to say why current.ogg could not be found, and whether this is cause or result of some file corruption.

Interestingly, I also saw two different SQL errors in the log - "no such table" and "disk I/O error". While I doubt it, I can not rule out a strange disk error as the cause of the bugs. We finally received additional harddrives for this server but the connectors have not yet arrived. As soon as possible, we will move /home to the new disks.

Nevertheless, I consider it a bug that the bot apparently does not delete the output files if conversion failed. This might cause (some of the) problems, as the bot seems to just upload the corrupted file on the next run, without re-trying conversion (because the converted file already exists).

I guess that all corrupt files so far have failed conversion for some strange reason but were uploaded nevertheless, instead of re-trying the conversion first.

@erlehmann - could you have a look into this behavior and confirm whether my assumptions about the code are right?

erlehmann commented 10 years ago

Raphael Wimmer notifications@github.com writes:

I'm currently looking into this issue, concentrating on the last file Daniel mentioned - called s008 from now on and its 'brother' s009 from the same publication.

Observations: - s008 is bad, s009 is ok - s008 actually contains title screen and a part of the clip of s009 at the beginning, then comes garbage. This probably means that s008 is a corrupted file that consists of parts of s009 and other stuff. - both files have the same MD5 checksums on the OAMI server and on Wikimedia Commons, i.e. no error occured during transfer from OAMI server to WMC.

I find it somehow relieving that the error lies where we can fix it.

The log file (oami_pmc_pmcid_import.log) shows that the bot has tried to convert s008 and s009 on 2014-06-29 and 2014-06-30 but failed on conversion (we should definitively clean up the logging, so much garbage):

The OAMI is already using the logging module, there just is no command line switch for it.

“/home/danielmietchen/.cache/open-access-media-importer/media/raw/pmc_pmcid/http%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2F
PMC4070943%2Fbin%2Fpone.0100366.s008.wmv”, saving into
“/home/danielmietchen/.cache/open-access-media-importer/media/refined/pmc_pmcid/http%
3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2FPMC4070943%2Fbin%2Fpone.0100366.s008.wmv.ogg”
… 0% | |^MSkipping conversion of
“/home/danielmietchen/.cache/open-access-media-importer/media/raw/pmc_pmcid/http
%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2FPMC4070943%2Fbin%2Fpone.0100366.s008.wmv”,
earlier attempt failed.

File "./oa-cache", line 146, in module> 
  f = mutagen.oggtheora.OggTheora(temporary_media_path) 
 File "/usr/lib/python2.7/dist-packages/mutagen/__init__.py", line 73, in __init__
             self.load(filename, *args, **kwargs)
               File "/usr/lib/python2.7/dist-packages/mutagen/ogg.py", line 438, in load
                   fileobj = file(filename, "rb")                
IOError: [Errno 2] Datei oder Verzeichnis nicht gefunden: '/home/danielmietchen/.cache/open-access-media-importer/media/refined/pmc_pmcid/current.ogg' 

I do not know the codebase well enough to be able to say why current.ogg could not be found, and whether this is cause or result of some file corruption.

The only code paths I can remember that remove current.ogv are:

  1. oa-cache clear-media (should delete current.ogv)
  2. oa-cache convert-media conversion (should rename current.ogv)

Interestingly, I also saw two different SQL errors in the log - "no such table" and "disk I/O error". While I doubt it, I can not rule out a strange disk error as the cause of the bugs. We finally received additional harddrives for this server but the connectors have not yet arrived. As soon as possible, we will move /home to the new disks.

Good.

Nevertheless, I consider it a bug that the bot apparently does not delete the output files if conversion failed. This might cause (some of the) problems, as the bot seems to just upload the corrupted file on the next run, without re-trying conversion (because the converted file already exists).

Another sane way I see out of this is immediately exiting on any kind of (encoding) failure with a non-zero exit code. I think I have advocated crashing on the slightest error in the past, do we have any reasons for not doing it that which still apply?

I guess that all corrupt files so far have failed conversion for some strange reason but were uploaded nevertheless, instead of re-trying the conversion first.

oa-cache currently passes on “mutagen.oggtheora.OggTheoraHeaderError” instead of crashing and then … carries on as if it succeeded, renaming current.ogv to the proper file name, writing “done.\n” to standard output and setting material.converted to True. The “pass” statement even contains the comment “Most probably an encoding failure.” – written by myself on 2012-07-21 in the middle of the night (02:46:48 +0200).

I might change this to be more fragile so it immediately crashes if anything in the conversion process does not go according to plan.

@erlehmann - could you have a look into this behavior and confirm whether my assumptions about the code are right?

I cannot say right now if this behaviour is the reason for the problems we are experiencing. I propose we rewrite the conversion failure logic From scratch by working out a state machine for the conversion process, including everything that could possibly go wrong (because it will).

Right now, we have the following states:

• before conversion: converted=False, converting=False • during conversion: converted=False, converting=True • after conversion: converted=True, converting=False

The “before conversion” state leads either to the “after conversion” state (if a converted file exists) or to the “during conversion” state (if no converted file exists). The “during conversion” state either leads to the “after conversion” state or remains indefinitely if it exists at the beginning of the conversion process, indicating that an earlier conversion attempt failed. By willfully ignoring the exception “mutagen.oggtheora.OggTheoraHeaderError” the “during conversion” state leads to the “after conversion” state

I therefore propose to either crash or continue to the next material on “mutagen.oggtheora.OggTheoraHeaderError” and see if this reduces how often these errors come up measured over a two-week period or whatever.

So … Daniel, Raphael, should the conversion process rather crash or continue on a bug? I would certainly like crashing more so we can find

bugs affecting it earlier, but it probably creates more work short-term.

Nils Dagsson Moskopp // erlehmann http://dieweltistgarnichtso.net

Daniel-Mietchen commented 10 years ago

Some more affected files: https://commons.wikimedia.org/wiki/File:Cep192-Controls-the-Balance-of-Centrosome-and-Non-Centrosomal-Microtubules-during-Interphase-pone.0101001.s009.ogv https://commons.wikimedia.org/wiki/File:Modeling-Glial-Contributions-to-Seizures-and-Epileptogenesis-Cation-Chloride-Cotransporters-in-pone.0101117.s004.ogv https://commons.wikimedia.org/wiki/File:Arp23-Inhibition-Induces-Amoeboid-Like-Protrusions-in-MCF10A-Epithelial-Cells-by-Reduced-pone.0100943.s006.ogv https://commons.wikimedia.org/wiki/File:Rab11-Regulates-Trafficking-of-Trans-sialidase-to-the-Plasma-Membrane-through-the-Contractile-ppat.1004224.s006.ogv https://commons.wikimedia.org/wiki/File:Conformational-Dynamics-of-Dry-Lamellar-Crystals-of-Sugar-Based-Lipids-An-Atomistic-Simulation-Study-pone.0101110.s009.ogv https://commons.wikimedia.org/wiki/File:Modeling-Neutralization-Kinetics-of-HIV-by-Broadly-Neutralizing-Monoclonal-Antibodies-in-Genital-pone.0100598.s003.ogv https://commons.wikimedia.org/wiki/File:Arp23-Inhibition-Induces-Amoeboid-Like-Protrusions-in-MCF10A-Epithelial-Cells-by-Reduced-pone.0100943.s007.ogv https://commons.wikimedia.org/wiki/File:Arp23-Inhibition-Induces-Amoeboid-Like-Protrusions-in-MCF10A-Epithelial-Cells-by-Reduced-pone.0100943.s008.ogv https://commons.wikimedia.org/wiki/File:Rab11-Regulates-Trafficking-of-Trans-sialidase-to-the-Plasma-Membrane-through-the-Contractile-ppat.1004224.s007.ogv

I just stopped the cron job.

I think letting oa-cache crash upon errors in the conversion process is probably the way to go, as long as these cases are logged, so that we can have a look at them more systematically.

Daniel-Mietchen commented 7 years ago

Another one: https://commons.wikimedia.org/wiki/File:The-Effects-of-Disease-Models-of-Nuclear-Actin-Polymerization-on-the-Nucleus-Video1.ogv