ytdl-org / youtube-dl

Command-line program to download videos from YouTube.com and other video sites
http://ytdl-org.github.io/youtube-dl/
The Unlicense
131.2k stars 9.93k forks source link

Filter YouTube livestream recordings which are not fully processed #26290

Open InternationalYam opened 4 years ago

InternationalYam commented 4 years ago

Checklist

Description

Immediately after a livestream on YouTube finishes, if it is sufficiently long, the full recording is not universally available. Generally only the final 2 hours can be viewed, though the duration can be longer depending on the settings for the stream or shorter if the stream goes offline and back online again. Eventually, some processing occurs on YouTube's side and the full video is re-encoded as a single video rather than an m3u8 playlist. Often this processing is done in minutes, but in extreme cases I have seen it take over a week. I have heard that the full livestream can be viewed immediately on mobile, but (not owning any to test) I have not found any way to get youtube-dl to grab one of those versions.

If an example is needed for testing, https://www.youtube.com/watch?v=LCt3b5updPQ is not yet processed and it will probably take a while, but due to the nature of the issue any one link won't work forever. I can provide more examples as they are needed or they can easily be found browsing through recent YouTube livestreams. Every YouTube livestream for which the recording is published goes through this process, though only sufficiently long ones will be clipped.

For automated archival purposes, it is obviously desirable to skip downloading versions which are not a full recording of the stream. In particular, if --download-archive downloaded.txt is used, the incomplete recordings will still be silently downloaded and added to downloaded.txt, preventing future downloads from getting the full version. Hence method for handling unprocessed streams (especially those for which the full recording are not yet available) is needed. Here are some approaches to this filtering inside youtube-dl that don't quite work:

  1. --match-filter "!is_live" does not work at all (though I think it used to?). It will exclude currently live streams but not those which are finished but not yet processed.
  2. --format "(bestvideo+bestaudio/best)[protocol!*=m3u8]" does not work because the video and/or audio feeds don't show up as m3u8 protocol for some reason; instead they generally return a null protocol. I am not sure if this is a bug.
  3. --format "(best)[protocol!*=m3u8]" does work, so long as you are fine with the reduced quality that will result. In my case this is not a good option. The streams I am getting are primarily audio focused, so to conserve bandwidth and storage I am presently using --format(bestvideo[height<=360]+bestaudio/best[height<=360]) in order to get the highest quality audio with just a passable video. If I switched from bestvideo[height<=360]+bestaudio to best in order to filter by protocol then the filesizes will be extremely large, and if I use height<=360 then the audio quality will generally be lower.
  4. Date-based filters can work, but only if a very long waiting period (e.g. 2 weeks) is used. This is not a good approach.
  5. --match-filter "duration>1" was a hack that previously worked perfectly as long as you aren't worried about missing actual 1s long videos. For whatever reason, unprocessed streams always used to show up as 1s long. Unfortunately, recent changes in YouTube's pipeline seem to have broken this hack. Particularly, now after a stream ends, youtube-dl now sometimes finds an audio track which returns the proper full duration of the livestream. As above this hack will work if you use --format best but will sometimes fail if you use --format (bestvideo+bestaudio). To be clear, if you actually do download that track, it will not be the full length, just the clipped (2 hour) length, but if you do a --get-duration or --write-info-json the full stream duration will be there. Note: the audio track issue described here is not consistent. It is transient and probably server dependent. Multiple nearly-simultaneous calls to get-duration will return 1s or the full duration seemingly at random. As a result, it is very hard to provide a live example of this but I have seen it now about 10 times in the past week (out of ~100 livestream downloads). It definitely isn't some one-of-a-kind occurrence, but I don't really know how a developer would be able to reproduce it on their end without just trying a bunch of recently finished livestreams.

I tried pretty much everything I could think of (including a bunch of other things not listed) but nothing seems to work flawlessly right now. If there is some way within youtube-dl to implement this filtering and I have stupidly missed it, I will be ecstatic to learn how to do so.

It is possible to work around all of this but only with a fair amount of work. Currently, my approach is to run a --get-ids check for the playlist/channel in question. Then looping over those returned IDs, run a --get-info-json --skip-download command first and parse that JSON file to check if the recording is safe to download (i.e. a single recording rather than a collection of fragments). Finally if the video is identified as safe it is downloaded and recorded to downloaded.txt. (You should do it this way rather than in just 2 commands downloading all the JSON files first and then the video files because the operation is time-sensitive; if the JSON files are all downloaded first, after the first few video downloads enough time can pass that the data has changed on YouTube's end.) Note that this is still not ideal. In addition to adding code complexity, it can still fail if, for example, the call to download the JSON and the call to download the actual recording go to different servers which are not synchronized. I have not seen it fail yet but given enough time it will surely happen. Simply doing the filtering inside youtube-dl is obviously superior and should not be that difficult. This method is also slower, and since it requires more youtube-dl calls it might increase the risk of getting disconnected by YouTube's servers (not sure on that).

Hence I'm requesting that one of these be implemented (ordered based on how easy I suspect they would be to accomplish):

(Aside: While I would be happy with either approach, filtering based on fragmentation or on completeness both have advantages. If download speed is a concern, the recording will download much faster after it is processed. The resulting filesize is also generally smaller. However, if the goal is to always have an up-to-date archive, then it is better to download complete livestream recordings even if they are fragmented, and perhaps re-encode them locally. I would certainly not complain if both these methods were available!)

InternationalYamAgain commented 4 years ago

A few updates:

  1. I forgot the login information for the account this was posted in. Or rather, I forgot the login for the corresponding email, and it was easier to create a new account than to try to figure it out. It should not happen again with this new account.

  2. In the time since I posted this, there have been further changes to YouTube's pipeline. Now the full video can be viewed in a regular desktop browser with almost no strings attached. The chat log remains unavailable until processing is done but in most other ways a casual observer would not notice a difference.

  3. The duration filter listed above now fails regularly (not every time but often enough to be useless). However I did find that filtering based on filesize (not approximate_filesize) seems to work in every case I've tried so far. Processed videos will give a definite filesize, even if they are eventually merged from separate audio and video feeds. Unprocessed videos give a null/NA filesize. If you want to filter for these videos, for now that is my suggestion. That said, there could be false positives I'm not aware of.

  4. youtube-dl still consistently gives just 2 hours (streamlink does as well but that is not especially relevant). As the other linked issue says, because the full video is visible in a browser normally, it is now hard to see the 2 hour limit as anything other than a bug in youtube-dl. At the time I posted this, it was more of a feature request because the 2 hour limit also applied to browsers. In any case the two posts represent the same underlying problem so they can be marked duplicates or handled however else is convenient. (I still think having an officially supported method for filtering these videos as an alternative to downloading the full video is a good idea because downloading the fragmented videos is awfully slow and the quality is often worse, but as long as there's a working method to either filter unprocessed videos or grab the full recording I won't complain.)

  5. Everything on YouTube's end could still change again. A month or two ago there was a similar situation where the full video could be viewed in a browser, but in that case it only worked for me in chrome and not in firefox. I didn't get around to figuring out what was going on back then and after about a week it went back to the clipped 2 hour archive. YouTube did announce publicly somewhere (I don't remember where) that they intended to do away with the 2 hour clipped recordings, and it's been a year or two since that announcement, so they will probably go away soon-ish, but the present state of affairs could be another trial run or it could be the new status quo.

glubsy commented 3 years ago

Thanks for reporting on this. I faced the same issue recently and was wondering why only the last 2 hours of a long live stream was downloaded by youtube-dl.