Open wellsyw opened 10 months ago
The problem is that the extractor code doesn't know whether it needs to calculate valid format URLs or not, since this is only known in the calling code (YoutubeDL.extract_info()
in youtube_dl/YoutubeDL.py
).
To achieve what you suggest with the current YT extraction scheme, we'd have to do one of these:
YoutubeDL
that the extractor code could call to determine whether valid format links are needed, orYoutubeDL.process_video_result()
(say), or discarded, depending on whether valid format links are needed.Or the extraction scheme could be modified so that a different YT client could be used, as in yt-dlp and as already used for age-gated videos. But that client gets fewer, if unthrottled, formats.
The n-sig processing is meant to cache results: when the same n-sig value is seen in a whole lot of formats for one video it should only be computed once.
I tried this command:
$ time python -m youtube_dl -o '%(id)s %(channel_id)s %(uploader)s' test:YouTube --get-filename
On this machine, low-spec by today's standards, distro Py3.11 seems to be almost 3x as fast as the miniconda Py2.7 (on another machine, distro Py2.7 is much closer to PPA Py3.9). Disabling n-sig processing cuts execution time from 9s to 3s with Py3 and 26s to 5s with Py2.
If someone were to try profiling the current code (we did that when the n-sig processing was first implemented), it might indicate some unsuspected hog.
One known execution time driver is that, whenever YT changes its player JS, we have to fetch that, a 2MB download, in addition to the bloated page and/or API JSON. This is still going to be a small part of the total run time with a typical modern internet connection.
If I understood your response correctly, you say that: 1) the code is decoupled enough that passing a command-line argument to the relevant code would be somewhat difficult to implement 2) upgrading python version may or may not give a performance boost.
A third, hackish approach would be trivial to implement: an environment variable could easily be passed to the code, and achieve the same effect, I guess.
For what it's worth, I have an Athlon II ("Rana" core) so it is not the newest thing around. Using youtube with the new 'polymer' interface is often unbearably slow, so I mostly use youtube-dl to make up for it.
Anyway, this is digressing but the normal runtime for youtube-dl with --simulate for me is about 13 seconds (so n-sig calculation takes ~70% of runtime), but I also found some videos where the runtime is much longer, 25-30 seconds or more per video and it is not caused by the n-sig processing, but rather something that happens between the two instances of n-sig solving, judging from a debug print or two. But I'll just file a new bug for that.
Please try #32695, or a new nightly build that incorporates it after I merge it.
Well, I'll be. Three seconds or a bit under.
But, er, all downloads seem to be throttled now?
I spotted a &n=%3Cfunction+inner+at+0x808a01d70%3E&
in the url.
FWIW, I switched to Python 3 and the total runtime (for one video) went down by a second and a half. That's rather significant, I guess. 30 seconds for five videos.
Although it doesn't affect your use case of collecting IDs, un-descrambled n-sig now gives 403 instead of just throttling.
Checklist
Description
The n-sig calculation takes roughly 4-5 seconds each for me (twice for each video), so solving it makes youtube-dl spend 10 seconds of cpu time for each url. It would be nice to have an option to skip the calculation if it is not strictly necessary, or even better, automatic detection of its necessity. For instance, I'm fairly sure the --get-title, --get-duration, -F (possibly) options do not require solving the signature.
I have below a simple helper script that displays the channel ids for youtube urls, and it just spent 42 seconds to get the information for five videos. This is, of course, unacceptable performance.
Of course, making the calculation about 100 times faster would be preferred, but until that happens, the ability to skip it on demand would be an acceptable substitute.