ytdl-org / youtube-dl

Command-line program to download videos from YouTube.com and other video sites
http://ytdl-org.github.io/youtube-dl/
The Unlicense
132.38k stars 10.04k forks source link

Support converting multilingual TTML to srt #12303

Open yan12125 opened 7 years ago

yan12125 commented 7 years ago

What is the purpose of your issue?


$ youtube-dl -v --write-sub --convert-subs srt test:daisuki
[debug] System config: []
[debug] User config: []
[debug] Custom config: []
[debug] Command-line args: ['-v', '--write-sub', '--convert-subs', 'srt', 'test:daisuki']
[debug] Encodings: locale UTF-8, fs utf-8, out UTF-8, pref UTF-8
[debug] youtube-dl version 2017.02.27
[debug] Git HEAD: 7c4aa6fd6
[debug] Python version 3.6.0 - Linux-4.10.1-1-ARCH-x86_64-with-arch
[debug] exe versions: ffmpeg 3.2.4, ffprobe 3.2.4, rtmpdump 2.4
[debug] Proxy map: {}
[TestURL] Test URL: http://www.daisuki.net/tw/en/anime/watch.TheIdolMasterCG.11213.html
[Daisuki] 11213: Downloading webpage
[Daisuki] 11213: Downloading JSON metadata
[Daisuki] 11213: Downloading m3u8 information
[info] Writing video subtitles to: #01 Who is in the pumpkin carriage - THE IDOLM@STER CINDERELLA GIRLS-11213.mul.ttml
[debug] Invoking downloader on 'https://bngn-vh.akamaihd.net/i/43383936/35470338/smil/TW/00005/454886408824423.smil/index_6000000_av.m3u8?null=0&id=AgCMcBxnoxzgBe+JtVig6tjALsUYU9c4vLlbWNR%2fIjKLjO3tedogpOqsv80VcutRxOme6T2ME6x0%2fQ%3d%3d'
[hlsnative] Downloading m3u8 manifest
WARNING: hlsnative has detected features it does not support, extraction will be delegated to ffmpeg
[download] Destination: #01 Who is in the pumpkin carriage - THE IDOLM@STER CINDERELLA GIRLS-11213.mp4
[debug] ffmpeg command line: ffmpeg -y -headers 'User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:10.0) Gecko/20150101 Firefox/47.0 (Chrome)
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Encoding: gzip, deflate
Accept-Language: en-us,en;q=0.5
Cookie: _alid_=dmV/+lBznv+Bca+is2H0ew==; hdntl=exp=1488378735~acl=%2f*~data=hdntl~hmac=e9d75e91b0278ee8489e785fb97e5f8d2a203dc603fbeae88c22eccbc6be5e63
' -i 'https://bngn-vh.akamaihd.net/i/43383936/35470338/smil/TW/00005/454886408824423.smil/index_6000000_av.m3u8?null=0&id=AgCMcBxnoxzgBe+JtVig6tjALsUYU9c4vLlbWNR%2fIjKLjO3tedogpOqsv80VcutRxOme6T2ME6x0%2fQ%3d%3d' -c copy -f mp4 'file:#01 Who is in the pumpkin carriage - THE IDOLM@STER CINDERELLA GIRLS-11213.mp4.part'
ffmpeg version 3.2.4 Copyright (c) 2000-2017 the FFmpeg developers
  built with gcc 6.3.1 (GCC) 20170109
  configuration: --prefix=/usr --disable-debug --disable-static --disable-stripping --enable-avisynth --enable-avresample --enable-fontconfig --enable-gmp --enable-gnutls --enable-gpl --enable-ladspa --enable-libass --enable-libbluray --enable-libfreetype --enable-libfribidi --enable-libgsm --enable-libiec61883 --enable-libmodplug --enable-libmp3lame --enable-libopencore_amrnb --enable-libopencore_amrwb --enable-libopenjpeg --enable-libopus --enable-libpulse --enable-libschroedinger --enable-libsoxr --enable-libspeex --enable-libssh --enable-libtheora --enable-libv4l2 --enable-libvidstab --enable-libvorbis --enable-libvpx --enable-libwebp --enable-libx264 --enable-libx265 --enable-libxvid --enable-netcdf --enable-shared --enable-version3 --enable-x11grab
  libavutil      55. 34.101 / 55. 34.101
  libavcodec     57. 64.101 / 57. 64.101
  libavformat    57. 56.101 / 57. 56.101
  libavdevice    57.  1.100 / 57.  1.100
  libavfilter     6. 65.100 /  6. 65.100
  libavresample   3.  1.  0 /  3.  1.  0
  libswscale      4.  2.100 /  4.  2.100
  libswresample   2.  3.100 /  2.  3.100
  libpostproc    54.  1.100 / 54.  1.100
[NULL @ 0x562846175960] non-existing SPS 0 referenced in buffering period
[NULL @ 0x562846175960] SPS unavailable in decode_picture_timing                                                                        
[h264 @ 0x56284624b520] non-existing SPS 0 referenced in buffering period                                                               
[h264 @ 0x56284624b520] SPS unavailable in decode_picture_timing                                                                        
Input #0, hls,applehttp, from 'https://bngn-vh.akamaihd.net/i/43383936/35470338/smil/TW/00005/454886408824423.smil/index_6000000_av.m3u8?null=0&id=AgCMcBxnoxzgBe+JtVig6tjALsUYU9c4vLlbWNR%2fIjKLjO3tedogpOqsv80VcutRxOme6T2ME6x0%2fQ%3d%3d':
  Duration: 00:24:00.00, start: 0.100667, bitrate: 0 kb/s
  Program 0 
    Metadata:
      variant_bitrate : 0
    Stream #0:0: Video: h264 (High) ([27][0][0][0] / 0x001B), yuv420p(tv, bt709), 1920x1080 [SAR 1:1 DAR 16:9], 23.98 fps, 23.98 tbr, 90k tbn, 47.95 tbc
    Metadata:
      variant_bitrate : 0
    Stream #0:1: Audio: aac (LC) ([15][0][0][0] / 0x000F), 48000 Hz, stereo, fltp
    Metadata:
      variant_bitrate : 0
Output #0, mp4, to 'file:#01 Who is in the pumpkin carriage - THE IDOLM@STER CINDERELLA GIRLS-11213.mp4.part':
  Metadata:
    encoder         : Lavf57.56.101
    Stream #0:0: Video: h264 (High) ([33][0][0][0] / 0x0021), yuv420p(tv, bt709), 1920x1080 [SAR 1:1 DAR 16:9], q=2-31, 23.98 fps, 23.98 tbr, 90k tbn, 90k tbc
    Metadata:
      variant_bitrate : 0
    Stream #0:1: Audio: aac (LC) ([64][0][0][0] / 0x0040), 48000 Hz, stereo
    Metadata:
      variant_bitrate : 0
Stream mapping:
  Stream #0:0 -> #0:0 (copy)
  Stream #0:1 -> #0:1 (copy)
Press [q] to stop, [?] for help
frame=34525 fps=347 q=-1.0 Lsize=  845745kB time=00:23:59.97 bitrate=4811.4kbits/s speed=14.5x    
video:811849kB audio:33986kB subtitle:0kB other streams:0kB global headers:1kB muxing overhead: unknown
Exception ignored in: <_io.FileIO name=6 mode='wb' closefd=True>
ResourceWarning: unclosed file <_io.BufferedWriter name=6>
[ffmpeg] Downloaded 866043111 bytes
[download] 100% of 825.92MiB
[download] 100% of 825.92MiB
[debug] ffmpeg command line: ffprobe -show_streams 'file:#01 Who is in the pumpkin carriage - THE IDOLM@STER CINDERELLA GIRLS-11213.mp4'
[ffmpeg] Fixing malformated aac bitstream in "#01 Who is in the pumpkin carriage - THE IDOLM@STER CINDERELLA GIRLS-11213.mp4"
[debug] ffmpeg command line: ffmpeg -y -i 'file:#01 Who is in the pumpkin carriage - THE IDOLM@STER CINDERELLA GIRLS-11213.mp4' -c copy -f mp4 -bsf:a aac_adtstoasc 'file:#01 Who is in the pumpkin carriage - THE IDOLM@STER CINDERELLA GIRLS-11213.temp.mp4'
[ffmpeg] Converting subtitles
WARNING: You have requested to convert dfxp (TTML) subtitles into another format, which results in style information loss
Deleting original file #01 Who is in the pumpkin carriage - THE IDOLM@STER CINDERELLA GIRLS-11213.mul.ttml (pass -k to keep)

Note that youtube-dl does not support sites dedicated to copyright infringement. In order for site support request to be accepted all provided example URLs should not violate any copyrights.


Description of your issue, suggested solution and other information

test:daisuki has a TTML subtitle http://bngnwww.b-ch.com/caption/35470338/1206/275503087581916/0817102633.xml. It contains multiple languages:

    <div xml:lang="English">
    <p begin="00:00:08.690" end="00:00:12.150" style="1">
    It was just a little while ago...
    </p>
    ...
    </div>
    <div xml:lang="Korean">
    <p begin="00:00:08.519" end="00:00:12.078" style="1">
    얼마 전까지 우리는
    </p>
    ...
    </div>

Seems SRT does not support multiple languages in the same file? If so dfxp2srt should return a lang => subtitle dictionary and FFmpegSubtitlesConvertorPP need to handle multiple files.

Ref: #4738

federicorosso1993 commented 7 years ago

I was able to use xmlstarlet and ttml2srt.py (by nomoketo) to only extract my own language on multi-language ttml subtitles file:

xmlstarlet ed -N ns=http://www.w3.org/2006/04/ttaf1 -d "//ns:div[not(contains(@xml:lang,'Italian'))]" "/path/of/the/original/subtitle.mul.ttml" > "/path/to/save/subtitle.ttml" && python3 ttml2srt.py "/path/to/just/converted/subtitle.ttml" > "/path/to/save/subtitle.srt"

since ttml is an xml file by using the correct namespace you can use xmlstarlet to only extract one language not contains xml:lang 'Language' from daisuki ttml multilanguage files... and ttml2srt.py is only a basic converter (maybe you use a better one) to convert ttml to srt.