ytdl-org / youtube-dl

Command-line program to download videos from YouTube.com and other video sites
http://ytdl-org.github.io/youtube-dl/
The Unlicense
131.4k stars 9.96k forks source link

Unable to download the YouTube channel member video #30987

Open fairfaxhshw opened 2 years ago

fairfaxhshw commented 2 years ago

Checklist

Verbose log

C:\Users\OWNER\Music\youtube-dl>youtube-dl -v --cookies cookies.txt -f best --external-downloader aria2c --external-downloader-args "-j 16 -x 16 -s 16 -k 1M" https://www.youtube.com/watch?v=MCy7s-c5xAw
[debug] System config: []
[debug] User config: []
[debug] Custom config: []
[debug] Command-line args: ['-v', '--cookies', 'cookies.txt', '-f', 'best', '--external-downloader', 'aria2c', '--external-downloader-args', '-j 16 -x 16 -s 16 -k 1M', 'https://www.youtube.com/watch?v=MCy7s-c5xAw']
[debug] Encodings: locale cp949, fs mbcs, out cp949, pref cp949
[debug] youtube-dl version 2021.12.17
[debug] Python version 3.4.4 (CPython) - Windows-10-10.0.19041
[debug] exe versions: ffmpeg 4.4-full_build-www.gyan.dev, ffprobe 4.4-full_build-www.gyan.dev
[debug] Proxy map: {}
[youtube] MCy7s-c5xAw: Downloading webpage
WARNING: [youtube] MCy7s-c5xAw: Failed to parse JSON Unterminated string starting at: line 1 column 60361 (char 60360)
[youtube] MCy7s-c5xAw: Downloading API JSON
ERROR: This video is available to this channel's members on level: TEAM AAAA (or any higher level). Join this channel to get access to members-only content and other exclusive perks.
Traceback (most recent call last):
  File "C:\Users\dst\AppData\Roaming\Build archive\youtube-dl\ytdl-org\tmpupik7c6w\build\youtube_dl\YoutubeDL.py", line 815, in wrapper
  File "C:\Users\dst\AppData\Roaming\Build archive\youtube-dl\ytdl-org\tmpupik7c6w\build\youtube_dl\YoutubeDL.py", line 836, in __extract_info
  File "C:\Users\dst\AppData\Roaming\Build archive\youtube-dl\ytdl-org\tmpupik7c6w\build\youtube_dl\extractor\common.py", line 534, in extract
  File "C:\Users\dst\AppData\Roaming\Build archive\youtube-dl\ytdl-org\tmpupik7c6w\build\youtube_dl\extractor\youtube.py", line 1731, in _real_extract
youtube_dl.utils.ExtractorError: This video is available to this channel's members on level: TEAM AAAA (or any higher level). Join this channel to get access to members-only content and other exclusive perks.

Description

Trying to download the YouTube members only video, and unable to download the video. I'm currently a member of the channel, and able to watch the YouTube video at the website. I downloaded the most recent cookies and applied it on the script.

I'm able to download the other videos that do not require "members only" using Youtube-dl. I was able to download the members only video about two weeks ago without a problem.

dirkf commented 2 years ago

Your output is the same as that for non-members except for this:

WARNING: [youtube] MCy7s-c5xAw: Failed to parse JSON Unterminated string starting at: line 1 column 60361 (char 60360)

So the cookie is having some effect. Apparently the cookie-driven page has a different structure such that the embedded hydration JSON (probably) isn't correctly extracted. The extractor then POSTs a query to https://www.youtube.com/youtubei/v1/player, which gives JSON containing the logged error message.

If you use --write-pages and attach the resulting files, it should be possible to analyse the page and find the problem.

See also #29928. Other historical issues seem to have been due to the cookie file being incorrect or not specified correctly.

coletdjnz commented 2 years ago

this is fixed in yt-dlp by https://github.com/yt-dlp/yt-dlp/commit/ee27297f82ccbd702ccd4721d1d3c9d67bbe187e

test video: https://www.youtube.com/watch?v=tjjjtzRLHvA

coletdjnz commented 2 years ago

So the cookie is having some effect. Apparently the cookie-driven page has a different structure such that the embedded hydration JSON (probably) isn't correctly extracted. The extractor then POSTs a query to https://www.youtube.com/youtubei/v1/player, which gives JSON containing the logged error message.

yeah youtube-dl has no auth support with innertube, hence this error (was one of the early things fixed in yt-dlp). this player request itself is lacking many parameters too so it doesn't always work, so youtube-dl is reliant on extracting the data from the webpage (which is failing here).

dirkf commented 2 years ago

this is fixed in yt-dlp by yt-dlp/yt-dlp@ee27297

Well, I could run the test video, but what are these unparseable sham JSON strings?

dirkf commented 2 years ago

this is fixed in yt-dlp by yt-dlp/yt-dlp@ee27297

Really, that seems like a bit of a hack, unless there is a use case for fatal=True, lenient=True. Don't we want to know when the extraction is going wrong?

Well, I could run the test video, but what are these unparseable sham JSON strings?

I have run the test video. Aha! The test video's title is the rather antagonistic ハッシュタグ無し };if\n window.ytcsi (apparently "no hashtag};..."), which breaks the pattern used to extract the YT initial data, as \n doesn't match .+? without re.DOTALL. Also, we're looking for a block that's terminated by ; var meta = whereas YT is now setting var head = first. The fallback pattern then returns an initial substring of the JSON that crashes the parser.

The initial hydration data may also contain a potentially confusing chunk of JS as the value of its attestation.playerAttestationRendererinterpreterSafeScript.botguardData.privateDoNotAccessOrElseSafeScriptWrappedValue member. As it's minified with fewer than 3339 variables, its variables are at most 2 characters.

Finally, yt-dl has largely identical methods YoutubeIE._extract_yt_initial_variable(), YoutubeBaseInfoExtractor._extract_yt_initial_data() that should be unified as YoutubeBaseInfoExtractor._extract_yt_initial_variable() (yt-dlp has YoutubeBaseInfoExtractor.extract_yt_initial_data(), but it's not apparently used outside the YT extractor and the same could apply).

If we strip the trailing ; from the main pattern and make this _YT_INITIAL_BOUNDARY_RE

r'(?:;\s*var\s+[\w$]{3,}|;?\s*</script|;\s*?\n)'

the JSON can be correctly extracted.

--- old/youtube-dl/youtube_dl/extractor/youtube.py
+++ new/youtube-dl/youtube_dl/extractor/youtube.py
@@ -284,7 +284,7 @@

     _YT_INITIAL_DATA_RE = r'(?:window\s*\[\s*["\']ytInitialData["\']\s*\]|ytInitialData)\s*=\s*({.+?})\s*;'
     _YT_INITIAL_PLAYER_RESPONSE_RE = r'ytInitialPlayerResponse\s*=\s*({.+?})\s*;'
-    _YT_INITIAL_BOUNDARY_RE = r'(?:var\s+meta|</script|\n)'
+    _YT_INITIAL_BOUNDARY_RE = r'(?:;\s*var\s+[\w$]{3,}|;?\s*</script|;\s*?\n)'

     def _call_api(self, ep, query, video_id, fatal=True):
         data = self._DEFAULT_API_DATA.copy()
@@ -297,12 +297,10 @@
             headers={'content-type': 'application/json'},
             query={'key': 'AIzaSyAO_FJ2SlqU8Q4STEHLGCilw_Y9_11qcW8'})

-    def _extract_yt_initial_data(self, video_id, webpage):
-        return self._parse_json(
-            self._search_regex(
-                (r'%s\s*%s' % (self._YT_INITIAL_DATA_RE, self._YT_INITIAL_BOUNDARY_RE),
-                 self._YT_INITIAL_DATA_RE), webpage, 'yt initial data'),
-            video_id)
+    def _extract_yt_initial_variable(self, webpage, regex, video_id, name):
+        return self._parse_json(self._search_regex(
+            (r'(?s)%s\s*%s' % (regex.rstrip(';'), self._YT_INITIAL_BOUNDARY_RE),
+             regex), webpage, name, default='{}'), video_id, fatal=False)

     def _extract_ytcfg(self, video_id, webpage):
         return self._parse_json(
@@ -1654,11 +1652,6 @@
             })
         return chapters

-    def _extract_yt_initial_variable(self, webpage, regex, video_id, name):
-        return self._parse_json(self._search_regex(
-            (r'%s\s*%s' % (regex, self._YT_INITIAL_BOUNDARY_RE),
-             regex), webpage, name, default='{}'), video_id, fatal=False)
-
     def _real_extract(self, url):
         url, smuggled_data = unsmuggle_url(url, {})
         video_id = self._match_id(url)
@@ -3026,7 +3019,7 @@
                 return self.url_result(video_id, ie=YoutubeIE.ie_key(), video_id=video_id)
             self.to_screen('Downloading playlist %s - add --no-playlist to just download video %s' % (playlist_id, video_id))
         webpage = self._download_webpage(url, item_id)
-        data = self._extract_yt_initial_data(item_id, webpage)
+        data = self._extract_yt_initial_variable(webpage, self._YT_INITIAL_DATA_RE, video_id, 'yt initial data')
         tabs = try_get(
             data, lambda x: x['contents']['twoColumnBrowseResultsRenderer']['tabs'], list)
         if tabs:

And the test video tjjjtzRLHvA:

$ python -m youtube_dl -v -F --ignore-config 'https://www.youtube.com/watch?v=tjjjtzRLHvA'

[debug] System config: []
[debug] User config: []
[debug] Custom config: []
[debug] Command-line args: [u'-v', u'-F', u'--ignore-config', u'https://www.youtube.com/watch?v=tjjjtzRLHvA']
[debug] Encodings: locale UTF-8, fs UTF-8, out UTF-8, pref UTF-8
[debug] youtube-dl version 2021.12.17
[debug] Git HEAD: 04fd3289d
[debug] Python version 2.7.17 (CPython) - Linux-4.4.0-210-generic-i686-with-Ubuntu-16.04-xenial
[debug] exe versions: avconv 4.3, avprobe 4.3, ffmpeg 4.3, ffprobe 4.3
[debug] Proxy map: {}
[youtube] tjjjtzRLHvA: Downloading webpage
[debug] [youtube] Decrypted nsig MZKNNaj5qnOtL1kDxc-q => WhWpMgo90a-uUQ
[debug] [youtube] Decrypted nsig lxmrUIylnXPO25AvJzZk => lRCxh6n1geqddw
[info] Available formats for tjjjtzRLHvA:
format code  extension  resolution note
249          webm       audio only tiny   41k , webm_dash container, opus @ 41k (48000Hz), 27.83KiB
250          webm       audio only tiny   42k , webm_dash container, opus @ 42k (48000Hz), 28.56KiB
251          webm       audio only tiny   84k , webm_dash container, opus @ 84k (48000Hz), 56.94KiB
140          m4a        audio only tiny  130k , m4a_dash container, mp4a.40.2@130k (44100Hz), 88.41KiB
160          mp4        82x144     144p   20k , mp4_dash container, avc1.4d400b@  20k, 30fps, video only, 14.04KiB
133          mp4        136x240    144p   40k , mp4_dash container, avc1.4d400c@  40k, 30fps, video only, 26.99KiB
278          webm       144x256    144p   45k , webm_dash container, vp9@  45k, 30fps, video only, 30.65KiB
242          webm       240x426    240p   58k , webm_dash container, vp9@  58k, 30fps, video only, 39.12KiB
134          mp4        202x360    240p   75k , mp4_dash container, avc1.4d400d@  75k, 30fps, video only, 50.54KiB
135          mp4        270x480    240p  143k , mp4_dash container, avc1.4d4015@ 143k, 30fps, video only, 96.55KiB
243          webm       360x640    360p  115k , webm_dash container, vp9@ 115k, 30fps, video only, 77.29KiB
136          mp4        406x720    360p  305k , mp4_dash container, avc1.64001e@ 305k, 30fps, video only, 205.08KiB
244          webm       480x854    480p  210k , webm_dash container, vp9@ 210k, 30fps, video only, 141.36KiB
137          mp4        608x1080   480p  610k , mp4_dash container, avc1.64001f@ 610k, 30fps, video only, 410.21KiB
247          webm       720x1280   720p  549k , webm_dash container, vp9@ 549k, 30fps, video only, 368.78KiB
18           mp4        360x640    360p  426k , avc1.42001E, 30fps, mp4a.40.2 (48000Hz), 288.73KiB
22           mp4        406x720    360p  435k , avc1.64001F, 30fps, mp4a.40.2 (44100Hz) (best)
$
coletdjnz commented 2 years ago

@pukkandan (since you were the one that wrote it)

jim60105 commented 2 years ago

ytarchive had a similar issue a few days ago, FYI https://github.com/Kethsar/ytarchive/issues/93#issuecomment-1140275153

fairfaxhshw commented 2 years ago

Alright. Just did "--write-pages" and attach the resulting files. I was unable to attach the dump files, so I have attach the compressed file

pukkandan commented 2 years ago

Maybe lenient is not a very good keyword. What it actually does is parse the json until an error is reached. In other words, it can parse json content embedded in a larger text (like {...}<..>)

Originally, I attempted to fix this issue with just regex. But since python regex does not support recursive groups or even possessive quantifiers, it is impossible to write a foolproof regex to capture json without creating catastrophic backtracking. Eg: r'ytInitialPlayerResponse\s*=\s*({(?:"(?:\\"|[^"])+"|[^"])+});' works, but hangs indefinitely if the regex is not found on the page

Actually, it is not the first time I am encountering this issue. The same problem existed when trying to isolate {...} code blocks for jsinterp. I had written JSInterpretter._seperate_at_paren for this reason. So I could add quoting support to this (and move it to utils) to address this use-case. (Note that the regex must be changed to greedy since we can handle over-capturing, but not under-capturing)

diff --git a/yt_dlp/extractor/common.py b/yt_dlp/extractor/common.py
index b24599d5f..7b74a4b64 100644
--- a/yt_dlp/extractor/common.py
+++ b/yt_dlp/extractor/common.py
@@ -1034,8 +1034,13 @@ def _download_json(
         return res if res is False else res[0]

     def _parse_json(self, json_string, video_id, transform_source=None, fatal=True):
-        if transform_source:
-            json_string = transform_source(json_string)
+        try:
+            if transform_source:
+                json_string = transform_source(json_string)
+        except ExtractorError as e:
+            if not fatal:
+                self.report_warning(f'{video_id}: Failed to transform JSON: {e}')
+            raise
         try:
             return json.loads(json_string, strict=False)
         except ValueError as ve:
diff --git a/yt_dlp/extractor/youtube.py b/yt_dlp/extractor/youtube.py
index 69b58088d..bf02f3d88 100644
--- a/yt_dlp/extractor/youtube.py
+++ b/yt_dlp/extractor/youtube.py
@@ -397,8 +397,8 @@ def _check_login_required(self):
         if self._LOGIN_REQUIRED and not self._cookies_passed:
             self.raise_login_required('Login details are needed to download this content', method='cookies')

-    _YT_INITIAL_DATA_RE = r'(?:window\s*\[\s*["\']ytInitialData["\']\s*\]|ytInitialData)\s*=\s*({.+?})\s*;'
-    _YT_INITIAL_PLAYER_RESPONSE_RE = r'ytInitialPlayerResponse\s*=\s*({.+?})\s*;'
+    _YT_INITIAL_DATA_RE = r'(?:window\s*\[\s*["\']ytInitialData["\']\s*\]|ytInitialData)\s*=\s*({.+})\s*;'
+    _YT_INITIAL_PLAYER_RESPONSE_RE = r'ytInitialPlayerResponse\s*=\s*({.+})\s*;'
     _YT_INITIAL_BOUNDARY_RE = r'(?:var\s+meta|</script|\n)'

     def _get_default_ytcfg(self, client='web'):
@@ -2743,9 +2743,10 @@ def _extract_chapters(self, chapter_list, chapter_time, chapter_title, duration)
         return chapters

     def _extract_yt_initial_variable(self, webpage, regex, video_id, name):
-        return self._parse_json(self._search_regex(
-            (fr'{regex}\s*{self._YT_INITIAL_BOUNDARY_RE}',
-             regex), webpage, name, default='{}'), video_id, fatal=False)
+        return self._parse_json(
+            self._search_regex(regex, webpage, name, default='{}'),
+            video_id, fatal=False,
+            transform_source=lambda x: '{%s}' % JSInterpreter._separate_at_paren(x, '}')[0])

     def _extract_comment(self, comment_renderer, parent=None):
         comment_id = comment_renderer.get('commentId')
diff --git a/yt_dlp/jsinterp.py b/yt_dlp/jsinterp.py
index 70857b798..56229cd99 100644
--- a/yt_dlp/jsinterp.py
+++ b/yt_dlp/jsinterp.py
@@ -24,6 +24,7 @@
 _NAME_RE = r'[a-zA-Z_$][a-zA-Z_$0-9]*'

 _MATCHING_PARENS = dict(zip('({[', ')}]'))
+_QUOTES = '\'"'

 class JS_Break(ExtractorError):
@@ -69,12 +70,17 @@ def _separate(expr, delim=',', max_split=None):
             return
         counters = {k: 0 for k in _MATCHING_PARENS.values()}
         start, splits, pos, delim_len = 0, 0, 0, len(delim) - 1
+        in_quote, escaping = None, False
         for idx, char in enumerate(expr):
             if char in _MATCHING_PARENS:
                 counters[_MATCHING_PARENS[char]] += 1
             elif char in counters:
                 counters[char] -= 1
-            if char != delim[pos] or any(counters.values()):
+            elif not escaping and char in _QUOTES and in_quote in (char, None):
+                in_quote = None if in_quote else char
+            escaping = not escaping and in_quote and char == '\\'
+
+            if char != delim[pos] or any(counters.values()) or in_quote:
                 pos = 0
                 continue
             elif pos != delim_len:

But when I thought about it more, this is what json.loads already does in JSONDecoder.raw_decode. The only difference is that the stdlib raises when the unparsed section is not just whitespace. So we can just catch that error, trim the json at the point of error, and try to parse it again. This is how I ended up with the current implementation.

Another solution could be to create a custom parser.

diff --git a/yt_dlp/extractor/common.py b/yt_dlp/extractor/common.py
index b24599d5f..d43280b07 100644
--- a/yt_dlp/extractor/common.py
+++ b/yt_dlp/extractor/common.py
@@ -35,6 +35,7 @@
     ExtractorError,
     GeoRestrictedError,
     GeoUtils,
+    LenientJSONDecoder,
     RegexNotFoundError,
     UnsupportedError,
     age_restricted,
@@ -1033,11 +1034,11 @@ def _download_json(
             expected_status=expected_status)
         return res if res is False else res[0]

-    def _parse_json(self, json_string, video_id, transform_source=None, fatal=True):
+    def _parse_json(self, json_string, video_id, transform_source=None, fatal=True, lenient=False):
         if transform_source:
             json_string = transform_source(json_string)
         try:
-            return json.loads(json_string, strict=False)
+            return json.loads(json_string, strict=False, cls=LenientJSONDecoder if lenient else None)
         except ValueError as ve:
             errmsg = '%s: Failed to parse JSON ' % video_id
             if fatal:
diff --git a/yt_dlp/extractor/youtube.py b/yt_dlp/extractor/youtube.py
index 245778dff..ee36c229f 100644
--- a/yt_dlp/extractor/youtube.py
+++ b/yt_dlp/extractor/youtube.py
@@ -397,8 +397,8 @@ def _check_login_required(self):
         if self._LOGIN_REQUIRED and not self._cookies_passed:
             self.raise_login_required('Login details are needed to download this content', method='cookies')

-    _YT_INITIAL_DATA_RE = r'(?:window\s*\[\s*["\']ytInitialData["\']\s*\]|ytInitialData)\s*=\s*({.+?})\s*;'
-    _YT_INITIAL_PLAYER_RESPONSE_RE = r'ytInitialPlayerResponse\s*=\s*({.+?})\s*;'
+    _YT_INITIAL_DATA_RE = r'(?:window\s*\[\s*["\']ytInitialData["\']\s*\]|ytInitialData)\s*=\s*({.+})\s*;'
+    _YT_INITIAL_PLAYER_RESPONSE_RE = r'ytInitialPlayerResponse\s*=\s*({.+})\s*;'
     _YT_INITIAL_BOUNDARY_RE = r'(?:var\s+meta|</script|\n)'

     def _get_default_ytcfg(self, client='web'):
@@ -2754,7 +2754,7 @@ def _extract_chapters(self, chapter_list, chapter_time, chapter_title, duration)
     def _extract_yt_initial_variable(self, webpage, regex, video_id, name):
         return self._parse_json(self._search_regex(
             (fr'{regex}\s*{self._YT_INITIAL_BOUNDARY_RE}',
-             regex), webpage, name, default='{}'), video_id, fatal=False)
+             regex), webpage, name, default='{}'), video_id, fatal=False, lenient=True)

     def _extract_comment(self, comment_renderer, parent=None):
         comment_id = comment_renderer.get('commentId')
diff --git a/yt_dlp/utils.py b/yt_dlp/utils.py
index b0300b724..ee858afaf 100644
--- a/yt_dlp/utils.py
+++ b/yt_dlp/utils.py
@@ -5381,6 +5381,13 @@ def __repr__(self):
         return f'{type(self).__name__}({", ".join(f"{k}={v}" for k, v in self)})'

+class LenientJSONDecoder(json.JSONDecoder):
+    """JSONDecoder that ignores excess text"""
+
+    def decode(self, s):
+        return self.raw_decode(s.lstrip())[0]
+
+
 # Deprecated
 has_certifi = bool(certifi)
 has_websockets = bool(websockets)

PS: Feel free to copy the code for any of these solutions (I honestly wouldn't recommend the regex though)

dirkf commented 2 years ago

Yes, not a hack at all, or rather, an excellent hack, when you actually read the code properly. Finding the end of a JSON block with regex is clearly unviable in general so it's much better to use the parser from the json module.

I'm told that the Go JSON parser has this lenience built in, which is why the ytarchive change mentioned above also did this:

... regex must be changed to greedy since we can handle over-capturing.

One comment:

        ...
        try:
            # should be outside the try block?
            if transform_source:
                json_string = transform_source(json_string)
        except ExtractorError as e:
        ...

Adding a decoder kw (default json.JSONDecoder) to _parse_json() might be good as some APIs may send XML or some other return in certain cases (error, eg) and it would be easier to handle that with a decoder class than transform_source. It could also replace transform_source though fatal handling as above would be less straightforward. A function as below could be applied to make a class for a transform_source function:

def json_transformer(transform_source):
    class xf(json.JSONDecoder):
        # in CPython 2.7 decode() will call this raw_decode()
        # with secret kwargs: check other implementations
        def raw_decode(self, s, **kwargs):
            s = transform_source(s)
            return super(xf, self).raw_decode(s, **kwargs)

    return xf

Or the transform_source could be tested and treated as a decoder class:

    def _parse_json(self, json_string, video_id, transform_source=None, fatal=True):
        if type(transform_source) == 'function':
            try:
                json_string = transform_source(json_string)
            except ExtractorError as e:
                if not fatal:
                    self.report_warning('{0}: Failed to transform JSON: {1}'.format(video_id, e))
                raise
        try:
            # allow duck typing, not just subclass of JSONDecoder
            if type(transform_source) == 'type':
                return json.loads(json_string, strict=False, cls=transform_source)
            return json.loads(json_string, strict=False)
        except ValueError as ve:
            ...
dirkf commented 2 years ago

Just did "--write-pages" and attach the resulting files.

Thanks. I checked the YT page that you dumped, and the same problem that I analysed above applies. So it should be fixed by the modified YT extractor.

pukkandan commented 2 years ago

FYI, yt-dlp's implementation has been changed to use a custom decoder. https://github.com/yt-dlp/yt-dlp/commit/b7c47b743871cdf3e0de75b17e4454d987384bf9

pukkandan commented 2 years ago

as some APIs may send XML or some other return in certain cases (error, eg) and it would be easier to handle that with a decoder class than transform_source

Why would you use _parse_json for XML? There is a different parser for it

dirkf commented 2 years ago

The case I found was a JSON API at trt.com that returned JSON normally, except that when the API failed it returned XML instead.