ytdl-org / youtube-dl

Command-line program to download videos from YouTube.com and other video sites
http://ytdl-org.github.io/youtube-dl/
The Unlicense
132.25k stars 10.03k forks source link

[drtv] subtitle empty #30653

Open dpriskorn opened 2 years ago

dpriskorn commented 2 years ago

Checklist

Verbose log

$ youtube-dl -f HLS-560 https://www.dr.dk/drtv/se/spionkrigen-i-ringsted_-agenten_297110 --write-sub -v
[debug] System config: []
[debug] User config: ['--restrict-filenames']
[debug] Custom config: []
[debug] Command-line args: ['--prefer-free-formats', '-t', '-f', 'HLS-560', 'https://www.dr.dk/drtv/se/spionkrigen-i-ringsted_-agenten_297110', '--write-sub', '-v']
WARNING: --title is deprecated. Use -o "%(title)s-%(id)s.%(ext)s" instead.
[debug] Encodings: locale UTF-8, fs utf-8, out utf-8, pref UTF-8
[debug] youtube-dl version 2021.12.17
[debug] Python version 3.10.2 (CPython) - Linux-5.10.89-gnu1-1-lts-x86_64-with-glibc2.35
[debug] exe versions: ffmpeg present, ffprobe present, rtmpdump 2.4
[debug] Proxy map: {}
[drtv] spionkrigen-i-ringsted_-agenten_297110: Downloading webpage
[drtv] 00242105010: Downloading video JSON
[drtv] 00242105010: Downloading m3u8 information
[drtv] 00242105010: Downloading m3u8 information
[drtv] 00242105010: Downloading m3u8 information
[info] Writing video subtitles to: Spionkrigen_i_Ringsted_1_4_-_Agenten-00242105010.da.vtt
[debug] Invoking downloader on 'https://drod09h-vh.akamaihd.net/i/all/clear/streaming/5d/61f40be9aa5a612b344e0c5d/Spionkrigen-i-Ringsted_b8e4eadf521344929e67691987d35f10_,500,1100,2000,3500,5500,.mp4.csmil/index_0_av.m3u8?null=0'
[download] Spionkrigen_i_Ringsted_1_4_-_Agenten-00242105010.mp4 has already been downloaded
[download] 100% of 126.82MiB
[debug] ffmpeg command line: ffprobe -show_streams file:Spionkrigen_i_Ringsted_1_4_-_Agenten-00242105010.mp4

Description

The downloaded subtitle is practically empty.

$ cat Spionkrigen_i_Ringsted_1_4_-_Agenten-00242105010.da.vtt 
WEBVTT

The official player has subtitles that can be enabled (the video contains a lot of arabic)

dirkf commented 2 years ago

I can reproduce this. It looks like the same issue would be seen with yt-dlp too, so no easy fix there.

There are pending fixes for DRTV, but I don't think subtitles are affected, so some debugging is needed.

dpriskorn commented 2 years ago

I debugged a little. image This request for the master.m3u8 has the link to the vtt playlist with all the segments

curl 'https://drod09h-vh.akamaihd.net/i/all/clear/streaming/5d/61f40be9aa5a612b344e0c5d/Spionkrigen-i-Ringsted_b8e4eadf521344929e67691987d35f10_,500,1100,2000,3500,5500,.mp4.csmil/master.m3u8?cc1=name=Fremmedsprogstekster~default=yes~forced=no~lang=da~uri=https://drod09h-vh.akamaihd.net/p/allx/clear/download/5d/61f40be9aa5a612b344e0c5d/subtitles/Foreign-19324737-6e40ee8a-1569-4676-bdc0-3396bd943dd1/playlist.m3u8&cc2=name=Dansk~default=no~forced=no~lang=da~uri=https://drod09h-vh.akamaihd.net/p/allx/clear/download/5d/61f40be9aa5a612b344e0c5d/subtitles/Foreign_HardOfHearing-19324737-6e40ee8a-1569-4676-bdc0-3396bd943dd1/playlist.m3u8' -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:98.0) Gecko/20100101 Firefox/98.0' -H 'Accept: */*' -H 'Accept-Language: en-US,en;q=0.5' -H 'Accept-Encoding: gzip, deflate, br' -H 'Origin: https://www.dr.dk' -H 'Connection: keep-alive' -H 'Referer: https://www.dr.dk/' -H 'Sec-Fetch-Dest: empty' -H 'Sec-Fetch-Mode: cors' -H 'Sec-Fetch-Site: cross-site'

The link is in a comment https://drod09h-vh.akamaihd.net/p/allx/clear/download/5d/61f40be9aa5a612b344e0c5d/subtitles/Foreign-19324737-6e40ee8a-1569-4676-bdc0-3396bd943dd1/playlist.m3u8 -> image -> e.g. segment 6 here has subtitles

dirkf commented 2 years ago

So --list-subs gives

Available subtitles for 00242105010:
Language formats
da       vtt, vtt, vtt, vtt, vtt

while --all-subs just downloads the single dud .vtt.

The extractor looks up the show in https://www.dr.dk/mu-online/api/1.4/programcard. There are three assets in the returned programme metadata, each with two subtitle URLs, except the third which only has one.

[[
  {
    "MimeType": "text/vtt;charset=utf-8",
    "Type": "Foreign",
    "Uri": "https://drod09h-vh.akamaihd.net/p/allx/clear/download/5d/61f40be9aa5a612b344e0c5d/subtitles/Foreign-19324737-6e40ee8a-1569-4676-bdc0-3396bd943dd1.vtt",
    "Language": "Danish"
  },
  {
    "MimeType": "text/vtt;charset=utf-8",
    "Type": "Foreign_HardOfHearing",
    "Uri": "https://drod09h-vh.akamaihd.net/p/allx/clear/download/5d/61f40be9aa5a612b344e0c5d/subtitles/Foreign_HardOfHearing-19324737-6e40ee8a-1569-4676-bdc0-3396bd943dd1.vtt",
    "Language": "Danish"
  }
],
[{
  "MimeType": "text/vtt;charset=utf-8",
  "Type": "Foreign",
  "Uri": "https://drod04f-vh.akamaihd.net/p/allx/clear/download/89/61f40c63af5a612af86c7e89/subtitles/Foreign-19324737-6e40ee8a-1569-4676-bdc0-3396bd943dd1.vtt",
  "Language": "Danish"
},
{
  "MimeType": "text/vtt;charset=utf-8",
  "Type": "Foreign_HardOfHearing",
  "Uri": "https://drod04f-vh.akamaihd.net/p/allx/clear/download/89/61f40c63af5a612af86c7e89/subtitles/Foreign_HardOfHearing-19324737-6e40ee8a-1569-4676-bdc0-3396bd943dd1.vtt",
  "Language": "Danish"
}
],
[{
  "MimeType": "text/vtt;charset=utf-8",
  "Type": "Foreign_HardOfHearing",
  "Uri": "https://drod01e-vh.akamaihd.net/p/allx/clear/download/9e/61fbf808a95a612450c32a9e/subtitles/Foreign_HardOfHearing-19324737-09b133ba-24c7-4b05-a548-a12d9396d3f0.vtt",
  "Language": "Danish"
}
]]

This becomes clearer when we look at each asset:

(Pdb) p assets[0]
{u'Kind': u'VideoResource', u'Target': u'Default', ...
(Pdb) p assets[1]
{u'Kind': u'VideoResource', u'Target': u'SpokenSubtitles', ...
(Pdb) p assets[2]
{u'Kind': u'VideoResource', u'Target': u'SignLanguage', ...

The fifth URL in the list above is selected as the best if no --sub-format option was specified, as the subtitle list is supposed to be sorted from least to best preference, and that is the dud subtitle file for the signed version.

There are several problems here:

Looking at the available subtitles:

  1. Danish Foreign: presumably the non-Danish speech is rendered;
  2. Danish ForeignHardOfHearing: presumably all speech is rendered?
  3. Danish Foreign SpokenSubtitles: presumably the non-Danish speech is rendered (actually same as 1);
  4. Danish ForeignHardOfHearing SpokenSubtitles: presumably all speech is rendered? (probably same as 2)
  5. Danish Foreign SignLanguage: it's not obvious why there shouldn't be valid subtitles available so that non-DSL 'speakers' can also watch and get the foreign speech translated.

It's also not obvious why there shouldn't be Danish ForeignHardOfHearing SignLanguage subtitles to allow DSL and non-DSL speakers to watch together.

The extractor could assign a different language code for subtitles extracted from SignLanguage and VisuallyInterpreted Targets, such as sgn-dsl.

There isn't an official language code that means language X translations from other languages plus original language X. Maybe we could invent dan-da for this.

--- old/youtube-dl/youtube_dl/extractor/drtv.py
+++ new/youtube-dl/youtube_dl/extractor/drtv.py
@@ -15,6 +15,7 @@
     int_or_none,
     intlist_to_bytes,
     float_or_none,
+    ISO639Utils,
     mimetype2ext,
     str_or_none,
     try_get,
@@ -268,7 +269,12 @@
                     if not sub_uri:
                         continue
                     lang = subs.get('Language') or 'da'
-                    subtitles.setdefault(LANGS.get(lang, lang), []).append({
+                    lang = LANGS.get(lang, lang)
+                    if asset_target in ('SignLanguage', 'VisuallyInterpreted'):
+                        lang = 'sgn' + ('-dsl' if lang == 'da' else '')
+                    elif 'HardOfHearing' in subs.get('Type', ''):
+                        lang = '-'.join((ISO639Utils.short2long(lang), lang))
+                    subtitles.setdefault(lang, []).append({
                         'url': sub_uri,
                         'ext': mimetype2ext(subs.get('MimeType')) or 'vtt'
                     })

Then with this list output, the Danish Foreign subtitle file can be selected with --sub-lang da:

Available subtitles for 00242105010:
Language formats
sgn-dsl  vtt
dan-da   vtt, vtt
da       vtt, vtt

Although yt-dlp has a small change in subtitle processing, this issue would also apply there.