Open xwcq opened 1 year ago
Please show examples of the bad subtitles.
AFAIK DTVCC1 and CC1 are US captioning standards.
Thanks for your reply! https://www.youtube.com/watch?v=xbttvPugCoI
is one of the base cases. In fact, most subtitles of cnn news' videos are wrong when choosing DTVCC1/CC1.
add this to your command --write-auto-sub --sub-lang "en." you'll want --sub-lang "en." to catch all the variations of english codes on youtube - https://www.reddit.com/r/youtubedl/comments/wpq4y0/ytdlp_how_to_ensure_download_of_english_subtitles/
add this to your command --write-auto-sub --sub-lang "en." you'll want --sub-lang "en." to catch all the variations of english codes on youtube - https://www.reddit.com/r/youtubedl/comments/wpq4y0/ytdlp_how_to_ensure_download_of_english_subtitles/
I tries en.
/ en.*
/ en*
, but none of them worked... my command is
python -m youtube_dl --cookies ../www.youtube.com_cookies.txt --write-sub --write-auto-sub --sub-lang "en." -o "../videos/%(title)s.%(ext)s" --verbose --skip-download --convert-subs srt "https://www.youtube.com/watch?v=xbttvPugCoI"
, and the error message is
WARNING: en. subtitles not available for xbttvPugCoI
What is suggested for yt-dlp may (will) not work with yt-dl: compare the man pages for the sttl options.
What is suggested for yt-dlp may (will) not work with yt-dl: compare the man pages for the sttl options.
dlp works! Thanks :)
The extractor sees all the three sttl types in the order shown by OP. The last one seen replaces the others.
If any name
parameter is added to the language code (I used #
because -
and _
are already significant in language codes):
$ python -m youtube_dl -v --list-subs --simulate xbttvPugCoI
[debug] System config: [u'--prefer-ffmpeg']
[debug] User config: []
[debug] Custom config: []
[debug] Command-line args: [u'-v', u'--list-subs', u'--simulate', u'xbttvPugCoI']
[debug] Encodings: locale UTF-8, fs UTF-8, out UTF-8, pref UTF-8
[debug] youtube-dl version 2021.12.17
[debug] Git HEAD: 8116c315a
[debug] Python 2.7.18 (CPython i686 32bit) - Linux-4.4.0-210-generic-i686-with-Ubuntu-16.04-xenial - OpenSSL 1.1.1t 7 Feb 2023 - glibc 2.15
[debug] exe versions: avconv 4.3, avprobe 4.3, ffmpeg 4.3, ffprobe 4.3
[debug] Proxy map: {}
[youtube] xbttvPugCoI: Downloading webpage
[debug] [youtube] Decrypted nsig Y_jkF2DKaJK8jiwgGx => HI3NgB2b-l8wnw
[debug] [youtube] Decrypted nsig 2CdHdQAydqdUKIOdv9 => mE42h7RY4Mx_Ug
Available subtitles for xbttvPugCoI:
Language formats
en#DTVCC1 vtt, ttml, srv3, srv2, srv1, json3
en#CC1 vtt, ttml, srv3, srv2, srv1, json3
en vtt, ttml, srv3, srv2, srv1, json3
$
Now the en
version is fetched by default.
@pukkandan, @coletdjnz: would the above change be disruptive? It seems to improve the yt-dl user's experience for very little effort. The sttl IDs supplied by YT don't seem to be meaningful or useful to users.
A second question is why YT is sending these apparently kosher subtitles that turn out to be useless.
We currently use the lang - id
Language Name Formats
en English vtt, ttml, srv3, srv2, srv1, json3
en-uYU-mmqFLq8 English - CC1 vtt, ttml, srv3, srv2, srv1, json3
en-JkeT_87f4cc English - DTVCC1 vtt, ttml, srv3, srv2, srv1, json3
I don't mind switching to lang # name
if are we sure there can't be multiple CC1 etc.
Consider the sttl URL query parameters:
{'expire': ['1687463449'],
'hl': ['en'],
'ip': ['0.0.0.0'],
'ipbits': ['0'],
'key': ['yt8'],
'lang': ['en'],
'name': ['CC1'],
'opi': ['112496729'],
'signature': ['47220C5CCDB9A343AA854EA177A786825BD4A5BC.B4D6B8BEF5EA41E78E0FB0FA1E87BF8BAC4DF638'],
'sparams': ['ip,ipbits,expire,v,opi,xoaf'],
'v': ['xbttvPugCoI'],
'xoaf': ['5']}
opi
and xoaf
are obscure (to me) but a little testing shows that they can stay the same when name
changes. The resource must be keyed by v
(video_id
), lang
and name
, so name
s can't be duplicated for the same v
and lang
. The ID in the caption data {'vssId': '.en.uYU-mmqFLq8'}
is not mentioned in the query parameters.
A further point is that the name
is also available as a simpleText
value in the caption data, but I think it's more reliable to use the URL query parameters.
Then the existing code can be simplified (?):
@@ -2219,22 +2221,27 @@ class YoutubeIE(YoutubeBaseInfoExtractor):
container[lang_code] = lang_subs
subtitles = {}
- for caption_track in (pctr.get('captionTracks') or []):
+ for caption_track in traverse_obj(pctr, ('captionTracks', Ellipsis, {dict})):
base_url = caption_track.get('baseUrl')
if not base_url:
continue
if caption_track.get('kind') != 'asr':
- lang_code = caption_track.get('languageCode')
+ params = parse_qs(base_url)
+ lang_code = traverse_obj(params, ('lang', -1))
if not lang_code:
continue
+ if lang_code != caption_track.get('languageCode'):
+ self.report_warning(
+ 'Unexpected subtitle data format: %s != %s'
+ % (lang_code, caption_track.get('languageCode')))
+ continue
+ lang_code = join_nonempty(lang_code, traverse_obj(params, ('name', -1)), delim='#')
process_language(
subtitles, base_url, lang_code, {})
continue
automatic_captions = {}
- for translation_language in (pctr.get('translationLanguages') or []):
- translation_language_code = translation_language.get('languageCode')
- if not translation_language_code:
- continue
+ for translation_language_code in traverse_obj(
+ pctr, ('translationLanguages', Ellipsis, 'languageCode')):
process_language(
automatic_captions, base_url, translation_language_code,
{'tlang': translation_language_code})
A quick question: do yt-dlp have the capacity to display auto-generated subtitle or download it.
DETAILS:
what I am trying is pretty similar according to here (the issue) but it's not about downloading, it's about streaming a video from Youtube (YT) to mpv with yt-dlp
I ran yt-dlp --list-subs
and got a list of subs available in the video. It's strange that it said ocqvRdcM_pk has no subtitles
, source The video actually has auto-generated subtitle (english).
Auto-generated subtitles are called "automatic captions" in yt-dl and yt-dlp. Ask about yt-dlp there.
However, yt-dl lists a whole lot of available automatic captions, as does yt-dlp, but sorted by language name
Auto-generated subtitles are called "automatic captions" in yt-dl and yt-dlp. Ask about yt-dlp there.
However, yt-dl lists a whole lot of available automatic captions, as does yt-dlp, but sorted by language name.
by "yt-dl" you meant "youtube-dl", correct? If yes, then I'll go there, if not. please clarify your answer. you can give me "thumb-up" if it's a yes
I see i am on the wrong github (youtube-dl). I'm heading to yt-dlp.
Checklist
Question
For some youtube videos, there are three options for subtitles, 1. [English] 2. [English - CC1] 3. [English - DTVCC1]. The first one is the right subtitle and the other two seemed quite confusing and full of mistake.
when I used the following command to download subtitles, I found the downloaded subtitle is from [English - CC1] but not [English].
python -m youtube_dl --cookies ../www.youtube.com_cookies.txt --write-sub --sub-lang en -o "../videos/%(title)s.%(ext)s" --verbose --skip-download --convert-subs srt "https://www.youtube.com/watch?v=xbttvPugCoI"
Full debug message is
output of
list-subs
command is as followsAnd my question is, what can I do to download the right version of subtitles?