ytdl-org / youtube-dl

Command-line program to download videos from YouTube.com and other video sites
http://ytdl-org.github.io/youtube-dl/
The Unlicense
132.26k stars 10.03k forks source link

ability to skip n-sig calculation #32687

Open wellsyw opened 10 months ago

wellsyw commented 10 months ago

Checklist

Description

The n-sig calculation takes roughly 4-5 seconds each for me (twice for each video), so solving it makes youtube-dl spend 10 seconds of cpu time for each url. It would be nice to have an option to skip the calculation if it is not strictly necessary, or even better, automatic detection of its necessity. For instance, I'm fairly sure the --get-title, --get-duration, -F (possibly) options do not require solving the signature.

I have below a simple helper script that displays the channel ids for youtube urls, and it just spent 42 seconds to get the information for five videos. This is, of course, unacceptable performance.

#!/bin/sh
FORMAT='%(id)s %(channel_id)s %(uploader)s'

exec python youtube_dl/__main__.py \
    -o "$FORMAT" --get-filename \
    "$@"

Of course, making the calculation about 100 times faster would be preferred, but until that happens, the ability to skip it on demand would be an acceptable substitute.

dirkf commented 10 months ago

The problem is that the extractor code doesn't know whether it needs to calculate valid format URLs or not, since this is only known in the calling code (YoutubeDL.extract_info() in youtube_dl/YoutubeDL.py).

To achieve what you suggest with the current YT extraction scheme, we'd have to do one of these:

  1. define a method of YoutubeDL that the extractor code could call to determine whether valid format links are needed, or
  2. define a way of returning the format links as a continuation that is either evaluated by YoutubeDL.process_video_result() (say), or discarded, depending on whether valid format links are needed.

Or the extraction scheme could be modified so that a different YT client could be used, as in yt-dlp and as already used for age-gated videos. But that client gets fewer, if unthrottled, formats.

The n-sig processing is meant to cache results: when the same n-sig value is seen in a whole lot of formats for one video it should only be computed once.

I tried this command:

$ time python -m youtube_dl -o '%(id)s %(channel_id)s %(uploader)s' test:YouTube --get-filename

On this machine, low-spec by today's standards, distro Py3.11 seems to be almost 3x as fast as the miniconda Py2.7 (on another machine, distro Py2.7 is much closer to PPA Py3.9). Disabling n-sig processing cuts execution time from 9s to 3s with Py3 and 26s to 5s with Py2.

If someone were to try profiling the current code (we did that when the n-sig processing was first implemented), it might indicate some unsuspected hog.

One known execution time driver is that, whenever YT changes its player JS, we have to fetch that, a 2MB download, in addition to the bloated page and/or API JSON. This is still going to be a small part of the total run time with a typical modern internet connection.

wellsyw commented 10 months ago

If I understood your response correctly, you say that: 1) the code is decoupled enough that passing a command-line argument to the relevant code would be somewhat difficult to implement 2) upgrading python version may or may not give a performance boost.

A third, hackish approach would be trivial to implement: an environment variable could easily be passed to the code, and achieve the same effect, I guess.

For what it's worth, I have an Athlon II ("Rana" core) so it is not the newest thing around. Using youtube with the new 'polymer' interface is often unbearably slow, so I mostly use youtube-dl to make up for it.

Anyway, this is digressing but the normal runtime for youtube-dl with --simulate for me is about 13 seconds (so n-sig calculation takes ~70% of runtime), but I also found some videos where the runtime is much longer, 25-30 seconds or more per video and it is not caused by the n-sig processing, but rather something that happens between the two instances of n-sig solving, judging from a debug print or two. But I'll just file a new bug for that.

dirkf commented 9 months ago

Please try #32695, or a new nightly build that incorporates it after I merge it.

wellsyw commented 9 months ago

Well, I'll be. Three seconds or a bit under.

But, er, all downloads seem to be throttled now?

I spotted a &n=%3Cfunction+inner+at+0x808a01d70%3E& in the url.

wellsyw commented 2 months ago

FWIW, I switched to Python 3 and the total runtime (for one video) went down by a second and a half. That's rather significant, I guess. 30 seconds for five videos.

dirkf commented 2 months ago

Although it doesn't affect your use case of collecting IDs, un-descrambled n-sig now gives 403 instead of just throttling.