ytdl-org / youtube-dl

Command-line program to download videos from YouTube.com and other video sites
http://ytdl-org.github.io/youtube-dl/
The Unlicense
131.66k stars 9.97k forks source link

youtube-dl downloads subtitles of [English-CC1] or [English-DTVCC1] instead of [English] #32293

Open xwcq opened 1 year ago

xwcq commented 1 year ago

Checklist

Question

For some youtube videos, there are three options for subtitles, 1. [English] 2. [English - CC1] 3. [English - DTVCC1]. The first one is the right subtitle and the other two seemed quite confusing and full of mistake.

when I used the following command to download subtitles, I found the downloaded subtitle is from [English - CC1] but not [English]. python -m youtube_dl --cookies ../www.youtube.com_cookies.txt --write-sub --sub-lang en -o "../videos/%(title)s.%(ext)s" --verbose --skip-download --convert-subs srt "https://www.youtube.com/watch?v=xbttvPugCoI"

Full debug message is

[debug] System config: []
[debug] User config: []
[debug] Custom config: []
[debug] Command-line args: ['--cookies', '../www.youtube.com_cookies.txt', '--write-sub', '--sub-lang', 'en', '-o', '../videos/%(title)s.%(ext)s', '--verbose', '--skip-download', '--convert-subs', 'srt', 'https://www.youtube.com/watch?v=xbttvPugCoI']
[debug] Encodings: locale cp936, fs utf-8, out utf-8, pref cp936
[debug] youtube-dl version 2021.12.17
[debug] Python 3.11.3 (CPython AMD64 64bit) - Windows-10-10.0.19045-SP0 - OpenSSL 1.1.1t  7 Feb 2023
[debug] exe versions: ffmpeg 6.0-full_build-www.gyan.dev, ffprobe 6.0-full_build-www.gyan.dev
[debug] Proxy map: {'http': 'http://127.0.0.1:58591', 'https': 'http://127.0.0.1:58591', 'ftp': 'http://127.0.0.1:58591'}
[youtube] xbttvPugCoI: Downloading webpage
[debug] [youtube] Decrypted nsig bQeUfc_8w8x-5xdXGX2 => Ui5U1OOaMvAeXQ
[debug] [youtube] Decrypted nsig fUHpH6aFEXI8bGZQJc9 => xLQAiVNQZW4fFg
[debug] Default format spec: bestvideo+bestaudio/best
[info] Writing video subtitles to: ..\videos\Trump indicted in classified documents probe, sources say.en.vtt

output of list-subs command is as follows

[youtube] xbttvPugCoI: Downloading webpage
[debug] [youtube] Decrypted nsig azamJ6EGK8VZyJ6WaaK => HqeDViV61faDcg
[debug] [youtube] Decrypted nsig QXUslrBxdID5tiRoqzd => nxz-v10Zxj34qg
Available subtitles for xbttvPugCoI:
Language formats
en       vtt, ttml, srv3, srv2, srv1, json3

And my question is, what can I do to download the right version of subtitles?

dirkf commented 1 year ago

Please show examples of the bad subtitles.

AFAIK DTVCC1 and CC1 are US captioning standards.

xwcq commented 1 year ago

Thanks for your reply! https://www.youtube.com/watch?v=xbttvPugCoI is one of the base cases. In fact, most subtitles of cnn news' videos are wrong when choosing DTVCC1/CC1.

october262 commented 1 year ago

add this to your command --write-auto-sub --sub-lang "en." you'll want --sub-lang "en." to catch all the variations of english codes on youtube - https://www.reddit.com/r/youtubedl/comments/wpq4y0/ytdlp_how_to_ensure_download_of_english_subtitles/

xwcq commented 1 year ago

add this to your command --write-auto-sub --sub-lang "en." you'll want --sub-lang "en." to catch all the variations of english codes on youtube - https://www.reddit.com/r/youtubedl/comments/wpq4y0/ytdlp_how_to_ensure_download_of_english_subtitles/

I tries en. / en.* / en*, but none of them worked... my command is python -m youtube_dl --cookies ../www.youtube.com_cookies.txt --write-sub --write-auto-sub --sub-lang "en." -o "../videos/%(title)s.%(ext)s" --verbose --skip-download --convert-subs srt "https://www.youtube.com/watch?v=xbttvPugCoI" , and the error message is WARNING: en. subtitles not available for xbttvPugCoI

dirkf commented 1 year ago

What is suggested for yt-dlp may (will) not work with yt-dl: compare the man pages for the sttl options.

xwcq commented 1 year ago

What is suggested for yt-dlp may (will) not work with yt-dl: compare the man pages for the sttl options.

dlp works! Thanks :)

dirkf commented 1 year ago

The extractor sees all the three sttl types in the order shown by OP. The last one seen replaces the others.

If any name parameter is added to the language code (I used # because - and _ are already significant in language codes):

$ python -m youtube_dl -v --list-subs --simulate xbttvPugCoI
[debug] System config: [u'--prefer-ffmpeg']
[debug] User config: []
[debug] Custom config: []
[debug] Command-line args: [u'-v', u'--list-subs', u'--simulate', u'xbttvPugCoI']
[debug] Encodings: locale UTF-8, fs UTF-8, out UTF-8, pref UTF-8
[debug] youtube-dl version 2021.12.17
[debug] Git HEAD: 8116c315a
[debug] Python 2.7.18 (CPython i686 32bit) - Linux-4.4.0-210-generic-i686-with-Ubuntu-16.04-xenial - OpenSSL 1.1.1t  7 Feb 2023 - glibc 2.15
[debug] exe versions: avconv 4.3, avprobe 4.3, ffmpeg 4.3, ffprobe 4.3
[debug] Proxy map: {}
[youtube] xbttvPugCoI: Downloading webpage
[debug] [youtube] Decrypted nsig Y_jkF2DKaJK8jiwgGx => HI3NgB2b-l8wnw
[debug] [youtube] Decrypted nsig 2CdHdQAydqdUKIOdv9 => mE42h7RY4Mx_Ug
Available subtitles for xbttvPugCoI:
Language  formats
en#DTVCC1 vtt, ttml, srv3, srv2, srv1, json3
en#CC1    vtt, ttml, srv3, srv2, srv1, json3
en        vtt, ttml, srv3, srv2, srv1, json3
$

Now the en version is fetched by default.

dirkf commented 1 year ago

@pukkandan, @coletdjnz: would the above change be disruptive? It seems to improve the yt-dl user's experience for very little effort. The sttl IDs supplied by YT don't seem to be meaningful or useful to users.

A second question is why YT is sending these apparently kosher subtitles that turn out to be useless.

pukkandan commented 1 year ago

We currently use the lang - id

Language       Name             Formats
en             English          vtt, ttml, srv3, srv2, srv1, json3
en-uYU-mmqFLq8 English - CC1    vtt, ttml, srv3, srv2, srv1, json3
en-JkeT_87f4cc English - DTVCC1 vtt, ttml, srv3, srv2, srv1, json3

I don't mind switching to lang # name if are we sure there can't be multiple CC1 etc.

dirkf commented 1 year ago

Consider the sttl URL query parameters:

{'expire': ['1687463449'],
 'hl': ['en'],
 'ip': ['0.0.0.0'],
 'ipbits': ['0'],
 'key': ['yt8'],
 'lang': ['en'],
 'name': ['CC1'],
 'opi': ['112496729'],
 'signature': ['47220C5CCDB9A343AA854EA177A786825BD4A5BC.B4D6B8BEF5EA41E78E0FB0FA1E87BF8BAC4DF638'],
 'sparams': ['ip,ipbits,expire,v,opi,xoaf'],
 'v': ['xbttvPugCoI'],
 'xoaf': ['5']}

opi and xoaf are obscure (to me) but a little testing shows that they can stay the same when name changes. The resource must be keyed by v (video_id), lang and name, so names can't be duplicated for the same v and lang. The ID in the caption data {'vssId': '.en.uYU-mmqFLq8'} is not mentioned in the query parameters.

A further point is that the name is also available as a simpleText value in the caption data, but I think it's more reliable to use the URL query parameters.

dirkf commented 1 year ago

Then the existing code can be simplified (?):

@@ -2219,22 +2221,27 @@ class YoutubeIE(YoutubeBaseInfoExtractor):
                 container[lang_code] = lang_subs

             subtitles = {}
-            for caption_track in (pctr.get('captionTracks') or []):
+            for caption_track in traverse_obj(pctr, ('captionTracks', Ellipsis, {dict})):
                 base_url = caption_track.get('baseUrl')
                 if not base_url:
                     continue
                 if caption_track.get('kind') != 'asr':
-                    lang_code = caption_track.get('languageCode')
+                    params = parse_qs(base_url)
+                    lang_code = traverse_obj(params, ('lang', -1))
                     if not lang_code:
                         continue
+                    if lang_code != caption_track.get('languageCode'):
+                        self.report_warning(
+                            'Unexpected subtitle data format: %s != %s'
+                             % (lang_code, caption_track.get('languageCode')))
+                        continue
+                    lang_code = join_nonempty(lang_code, traverse_obj(params, ('name', -1)), delim='#')
                     process_language(
                         subtitles, base_url, lang_code, {})
                     continue
                 automatic_captions = {}
-                for translation_language in (pctr.get('translationLanguages') or []):
-                    translation_language_code = translation_language.get('languageCode')
-                    if not translation_language_code:
-                        continue
+                for translation_language_code in traverse_obj(
+                        pctr, ('translationLanguages', Ellipsis, 'languageCode')):
                     process_language(
                         automatic_captions, base_url, translation_language_code,
                         {'tlang': translation_language_code})
Xelbayria commented 1 year ago

A quick question: do yt-dlp have the capacity to display auto-generated subtitle or download it.

DETAILS: what I am trying is pretty similar according to here (the issue) but it's not about downloading, it's about streaming a video from Youtube (YT) to mpv with yt-dlp I ran yt-dlp --list-subs and got a list of subs available in the video. It's strange that it said ocqvRdcM_pk has no subtitles, source The video actually has auto-generated subtitle (english).

dirkf commented 1 year ago

Auto-generated subtitles are called "automatic captions" in yt-dl and yt-dlp. Ask about yt-dlp there.

However, yt-dl lists a whole lot of available automatic captions, as does yt-dlp, but sorted by language name

Xelbayria commented 1 year ago

Auto-generated subtitles are called "automatic captions" in yt-dl and yt-dlp. Ask about yt-dlp there.

However, yt-dl lists a whole lot of available automatic captions, as does yt-dlp, but sorted by language name.

by "yt-dl" you meant "youtube-dl", correct? If yes, then I'll go there, if not. please clarify your answer. you can give me "thumb-up" if it's a yes

I see i am on the wrong github (youtube-dl). I'm heading to yt-dlp.