[Crunchyroll?] Only extract requested subtitles

wiiaboo commented 9 years ago

Problem

Using --sub-lang to request one or two subtitles from Crunchyroll doesn't just extract the requested subtitles, but instead extracts all of them, leading to big delays before starting the stream, whether you use --all-subs or just --sub-lang enUS. In the case of sites where the subs just point to a certain URL, the extraction seems faster, so it's probably more of a problem for sites like Crunchyroll where you extract the full subtitles.

Solution 1

At least for sites like Crunchyroll, just extract the requested languages.

Solution 2

Add an option that just extracts the requested languages?

I should probably also mention that this is mostly useful when you want to stream the resulting URL, like through mpv. When you're just using youtube-dl directly to download the video the time extracting the subs is probably not an issue either.

remitamine commented 9 years ago

i propose 2 solution for this:

pass the requested subtitles to the extractor(i think it's not possible with the current youtube-dl code because the only information passed to the extractor is the url)
change the extractor return the subtitles urls and in process_subtitles detect if they are from the crunchyroll than process them in the same way they are processed in the crunchyroll extractor with the differance that the YoutubeDL object know the subtitleslangs so it can get only the requested languages.

dstftw commented 9 years ago

@remitamine both flawed as well as current approach. The reasonable solution would be a customizable extraction behavior (in particular for crunchyroll - subtitles decryption) that will be used by subtitles extractor or even a postprocessor.

fstirlitz commented 9 years ago

I had a similar problem while writing #6144. I ended up solving it with a few kludges to plug the downloader infrastructure into subtitle downloading (commit acbc6d38660092e90c4ab36110b30355d26c4363), but I'm not particularly proud of it.

wiiaboo commented 9 years ago

Seems to be an issue not just with subtitles but with resolutions too. At least on my connection, it takes half-a-dozen seconds for each "media info" page to download, even if I just request one resolution.

humitos commented 9 years ago

I'm having a similar problem: youtube-dl don't download just requested subtitles. Take a look at this example running:

$ youtube-dl --verbose --sub-lang "en,es" http://www.ted.com/talks/lang/es/john_hodgman_s_brief_digression[debug] System config: []
[debug] User config: []
[debug] Command-line args: [u'--restrict-filenames', u'--retries', u'50', u'--continue', u'--verbose', u'--sub-lang', u'en,es', u'http://www.ted.com/talks/lang/es/john_hodgman_s_brief_digression']
[debug] Encodings: locale UTF-8, fs UTF-8, out UTF-8, pref UTF-8
[debug] youtube-dl version 2015.08.09
[debug] Git HEAD: 9f3da13
[debug] Python version 2.7.6 - Linux-3.13.0-57-generic-x86_64-with-Ubuntu-14.04-trusty
[debug] exe versions: avconv 9.18-6, avprobe 9.18-6
[debug] Proxy map: {}
[ted] john_hodgman_s_brief_digression: Downloading webpage
[ted] john_hodgman_s_brief_digression: Extracting information
[ted] john_hodgman_s_brief_digression: Downloading m3u8 information
WARNING: Your copy of avconv is outdated and unable to properly mux separate video and audio files, youtube-dl will download single file media. Update avconv to version 10-0 or newer to fix this.
[debug] Invoking downloader on u'http://download.ted.com/talks/JohnHodgman_2008-480p.mp4?apikey=489b859150fc58263f17110eeb44ed5fba4a3b22'
[download] Resuming download at byte 1865239
[download] Destination: John_Hodgman_-_Una_breve_digresi_n_sobre_asuntos_del_tiempo_perdido-374.mp4
[download]   2.9% of 110.46MiB at 98.36KiB/s ETA 18:36^C
ERROR: Interrupted by user
$

But if I list the subtitles, they appear:

$ youtube-dl --verbose --list-subs http://www.ted.com/talks/lang/es/john_hodgman_s_brief_digression
[debug] System config: []
[debug] User config: []
[debug] Command-line args: [u'--restrict-filenames', u'--retries', u'50', u'--continue', u'--verbose', u'--list-subs', u'http://www.ted.com/talks/lang/es/john_hodgman_s_brief_digression']
[debug] Encodings: locale UTF-8, fs UTF-8, out UTF-8, pref UTF-8
[debug] youtube-dl version 2015.08.09
[debug] Git HEAD: 9f3da13
[debug] Python version 2.7.6 - Linux-3.13.0-57-generic-x86_64-with-Ubuntu-14.04-trusty
[debug] exe versions: avconv 9.18-6, avprobe 9.18-6
[debug] Proxy map: {}
[ted] john_hodgman_s_brief_digression: Downloading webpage
[ted] john_hodgman_s_brief_digression: Extracting information
[ted] john_hodgman_s_brief_digression: Downloading m3u8 information
Available subtitles for 374:
Language formats
el       srt, ted
en       srt, ted
it       srt, ted
ar       srt, ted
pt-br    srt, ted
cs       srt, ted
es       srt, ted
ru       srt, ted
nl       srt, ted
pt       srt, ted
zh-tw    srt, ted
tr       srt, ted
zh-cn    srt, ted
ro       srt, ted
pl       srt, ted
fr       srt, ted
bg       srt, ted
hr       srt, ted
de       srt, ted
hu       srt, ted
ja       srt, ted
he       srt, ted
sr       srt, ted
ko       srt, ted
sv       srt, ted
$

Thanks!

wiiaboo commented 9 years ago

You need --write-sub in addition to --sub-lang. --sub-lang just selects the ones to download. --all-subs doesn't need --write-sub.

humitos commented 9 years ago

@wiiaboo thanks a lot! It worked! I think it shouldn't be necessary to add that option, it doesn't make sense for me :)

wiiaboo commented 9 years ago

There's another way to associate the language names with the codes by reading the page language selection. Example:

languages = {k: v for (v, k) in re.findall(r';([a-z]{2}[A-Z]{2})[^ ]+ data-language="([^"]+)', webpage)}

Is there any way for _get_subtitles or _extract_subtitles to know which languages were requested?

ytdl-org / youtube-dl