Open fairfaxhshw opened 2 years ago
Your output is the same as that for non-members except for this:
WARNING: [youtube] MCy7s-c5xAw: Failed to parse JSON Unterminated string starting at: line 1 column 60361 (char 60360)
So the cookie is having some effect. Apparently the cookie-driven page has a different structure such that the embedded hydration JSON (probably) isn't correctly extracted. The extractor then POSTs a query to https://www.youtube.com/youtubei/v1/player, which gives JSON containing the logged error message.
If you use --write-pages
and attach the resulting files, it should be possible to analyse the page and find the problem.
See also #29928. Other historical issues seem to have been due to the cookie file being incorrect or not specified correctly.
this is fixed in yt-dlp by https://github.com/yt-dlp/yt-dlp/commit/ee27297f82ccbd702ccd4721d1d3c9d67bbe187e
test video: https://www.youtube.com/watch?v=tjjjtzRLHvA
So the cookie is having some effect. Apparently the cookie-driven page has a different structure such that the embedded hydration JSON (probably) isn't correctly extracted. The extractor then POSTs a query to https://www.youtube.com/youtubei/v1/player, which gives JSON containing the logged error message.
yeah youtube-dl has no auth support with innertube, hence this error (was one of the early things fixed in yt-dlp). this player request itself is lacking many parameters too so it doesn't always work, so youtube-dl is reliant on extracting the data from the webpage (which is failing here).
this is fixed in yt-dlp by yt-dlp/yt-dlp@ee27297
Well, I could run the test video, but what are these unparseable sham JSON strings?
this is fixed in yt-dlp by yt-dlp/yt-dlp@ee27297
Really, that seems like a bit of a hack, unless there is a use case for fatal=True, lenient=True
. Don't we want to know when the extraction is going wrong?
Well, I could run the test video, but what are these unparseable sham JSON strings?
I have run the test video. Aha! The test video's title is the rather antagonistic ハッシュタグ無し };if\n window.ytcsi
(apparently "no hashtag};..."), which breaks the pattern used to extract the YT initial data
, as \n
doesn't match .+?
without re.DOTALL
. Also, we're looking for a block that's terminated by ; var meta =
whereas YT is now setting var head =
first. The fallback pattern then returns an initial substring of the JSON that crashes the parser.
The initial hydration data may also contain a potentially confusing chunk of JS as the value of its attestation.playerAttestationRendererinterpreterSafeScript.botguardData.privateDoNotAccessOrElseSafeScriptWrappedValue
member. As it's minified with fewer than 3339 variables, its variables are at most 2 characters.
Finally, yt-dl has largely identical methods YoutubeIE._extract_yt_initial_variable()
, YoutubeBaseInfoExtractor._extract_yt_initial_data()
that should be unified as YoutubeBaseInfoExtractor._extract_yt_initial_variable()
(yt-dlp has YoutubeBaseInfoExtractor.extract_yt_initial_data()
, but it's not apparently used outside the YT extractor and the same could apply).
If we strip the trailing ;
from the main pattern and make this _YT_INITIAL_BOUNDARY_RE
r'(?:;\s*var\s+[\w$]{3,}|;?\s*</script|;\s*?\n)'
the JSON can be correctly extracted.
--- old/youtube-dl/youtube_dl/extractor/youtube.py
+++ new/youtube-dl/youtube_dl/extractor/youtube.py
@@ -284,7 +284,7 @@
_YT_INITIAL_DATA_RE = r'(?:window\s*\[\s*["\']ytInitialData["\']\s*\]|ytInitialData)\s*=\s*({.+?})\s*;'
_YT_INITIAL_PLAYER_RESPONSE_RE = r'ytInitialPlayerResponse\s*=\s*({.+?})\s*;'
- _YT_INITIAL_BOUNDARY_RE = r'(?:var\s+meta|</script|\n)'
+ _YT_INITIAL_BOUNDARY_RE = r'(?:;\s*var\s+[\w$]{3,}|;?\s*</script|;\s*?\n)'
def _call_api(self, ep, query, video_id, fatal=True):
data = self._DEFAULT_API_DATA.copy()
@@ -297,12 +297,10 @@
headers={'content-type': 'application/json'},
query={'key': 'AIzaSyAO_FJ2SlqU8Q4STEHLGCilw_Y9_11qcW8'})
- def _extract_yt_initial_data(self, video_id, webpage):
- return self._parse_json(
- self._search_regex(
- (r'%s\s*%s' % (self._YT_INITIAL_DATA_RE, self._YT_INITIAL_BOUNDARY_RE),
- self._YT_INITIAL_DATA_RE), webpage, 'yt initial data'),
- video_id)
+ def _extract_yt_initial_variable(self, webpage, regex, video_id, name):
+ return self._parse_json(self._search_regex(
+ (r'(?s)%s\s*%s' % (regex.rstrip(';'), self._YT_INITIAL_BOUNDARY_RE),
+ regex), webpage, name, default='{}'), video_id, fatal=False)
def _extract_ytcfg(self, video_id, webpage):
return self._parse_json(
@@ -1654,11 +1652,6 @@
})
return chapters
- def _extract_yt_initial_variable(self, webpage, regex, video_id, name):
- return self._parse_json(self._search_regex(
- (r'%s\s*%s' % (regex, self._YT_INITIAL_BOUNDARY_RE),
- regex), webpage, name, default='{}'), video_id, fatal=False)
-
def _real_extract(self, url):
url, smuggled_data = unsmuggle_url(url, {})
video_id = self._match_id(url)
@@ -3026,7 +3019,7 @@
return self.url_result(video_id, ie=YoutubeIE.ie_key(), video_id=video_id)
self.to_screen('Downloading playlist %s - add --no-playlist to just download video %s' % (playlist_id, video_id))
webpage = self._download_webpage(url, item_id)
- data = self._extract_yt_initial_data(item_id, webpage)
+ data = self._extract_yt_initial_variable(webpage, self._YT_INITIAL_DATA_RE, video_id, 'yt initial data')
tabs = try_get(
data, lambda x: x['contents']['twoColumnBrowseResultsRenderer']['tabs'], list)
if tabs:
And the test video tjjjtzRLHvA
:
$ python -m youtube_dl -v -F --ignore-config 'https://www.youtube.com/watch?v=tjjjtzRLHvA'
[debug] System config: []
[debug] User config: []
[debug] Custom config: []
[debug] Command-line args: [u'-v', u'-F', u'--ignore-config', u'https://www.youtube.com/watch?v=tjjjtzRLHvA']
[debug] Encodings: locale UTF-8, fs UTF-8, out UTF-8, pref UTF-8
[debug] youtube-dl version 2021.12.17
[debug] Git HEAD: 04fd3289d
[debug] Python version 2.7.17 (CPython) - Linux-4.4.0-210-generic-i686-with-Ubuntu-16.04-xenial
[debug] exe versions: avconv 4.3, avprobe 4.3, ffmpeg 4.3, ffprobe 4.3
[debug] Proxy map: {}
[youtube] tjjjtzRLHvA: Downloading webpage
[debug] [youtube] Decrypted nsig MZKNNaj5qnOtL1kDxc-q => WhWpMgo90a-uUQ
[debug] [youtube] Decrypted nsig lxmrUIylnXPO25AvJzZk => lRCxh6n1geqddw
[info] Available formats for tjjjtzRLHvA:
format code extension resolution note
249 webm audio only tiny 41k , webm_dash container, opus @ 41k (48000Hz), 27.83KiB
250 webm audio only tiny 42k , webm_dash container, opus @ 42k (48000Hz), 28.56KiB
251 webm audio only tiny 84k , webm_dash container, opus @ 84k (48000Hz), 56.94KiB
140 m4a audio only tiny 130k , m4a_dash container, mp4a.40.2@130k (44100Hz), 88.41KiB
160 mp4 82x144 144p 20k , mp4_dash container, avc1.4d400b@ 20k, 30fps, video only, 14.04KiB
133 mp4 136x240 144p 40k , mp4_dash container, avc1.4d400c@ 40k, 30fps, video only, 26.99KiB
278 webm 144x256 144p 45k , webm_dash container, vp9@ 45k, 30fps, video only, 30.65KiB
242 webm 240x426 240p 58k , webm_dash container, vp9@ 58k, 30fps, video only, 39.12KiB
134 mp4 202x360 240p 75k , mp4_dash container, avc1.4d400d@ 75k, 30fps, video only, 50.54KiB
135 mp4 270x480 240p 143k , mp4_dash container, avc1.4d4015@ 143k, 30fps, video only, 96.55KiB
243 webm 360x640 360p 115k , webm_dash container, vp9@ 115k, 30fps, video only, 77.29KiB
136 mp4 406x720 360p 305k , mp4_dash container, avc1.64001e@ 305k, 30fps, video only, 205.08KiB
244 webm 480x854 480p 210k , webm_dash container, vp9@ 210k, 30fps, video only, 141.36KiB
137 mp4 608x1080 480p 610k , mp4_dash container, avc1.64001f@ 610k, 30fps, video only, 410.21KiB
247 webm 720x1280 720p 549k , webm_dash container, vp9@ 549k, 30fps, video only, 368.78KiB
18 mp4 360x640 360p 426k , avc1.42001E, 30fps, mp4a.40.2 (48000Hz), 288.73KiB
22 mp4 406x720 360p 435k , avc1.64001F, 30fps, mp4a.40.2 (44100Hz) (best)
$
@pukkandan (since you were the one that wrote it)
ytarchive had a similar issue a few days ago, FYI https://github.com/Kethsar/ytarchive/issues/93#issuecomment-1140275153
Alright. Just did "--write-pages" and attach the resulting files. I was unable to attach the dump files, so I have attach the compressed file
Maybe lenient
is not a very good keyword. What it actually does is parse the json until an error is reached. In other words, it can parse json content embedded in a larger text (like {...}<..>
)
Originally, I attempted to fix this issue with just regex. But since python regex does not support recursive groups or even possessive quantifiers, it is impossible to write a foolproof regex to capture json without creating catastrophic backtracking. Eg: r'ytInitialPlayerResponse\s*=\s*({(?:"(?:\\"|[^"])+"|[^"])+});'
works, but hangs indefinitely if the regex is not found on the page
Actually, it is not the first time I am encountering this issue. The same problem existed when trying to isolate {...}
code blocks for jsinterp
. I had written JSInterpretter._seperate_at_paren
for this reason. So I could add quoting support to this (and move it to utils) to address this use-case. (Note that the regex must be changed to greedy since we can handle over-capturing, but not under-capturing)
diff --git a/yt_dlp/extractor/common.py b/yt_dlp/extractor/common.py
index b24599d5f..7b74a4b64 100644
--- a/yt_dlp/extractor/common.py
+++ b/yt_dlp/extractor/common.py
@@ -1034,8 +1034,13 @@ def _download_json(
return res if res is False else res[0]
def _parse_json(self, json_string, video_id, transform_source=None, fatal=True):
- if transform_source:
- json_string = transform_source(json_string)
+ try:
+ if transform_source:
+ json_string = transform_source(json_string)
+ except ExtractorError as e:
+ if not fatal:
+ self.report_warning(f'{video_id}: Failed to transform JSON: {e}')
+ raise
try:
return json.loads(json_string, strict=False)
except ValueError as ve:
diff --git a/yt_dlp/extractor/youtube.py b/yt_dlp/extractor/youtube.py
index 69b58088d..bf02f3d88 100644
--- a/yt_dlp/extractor/youtube.py
+++ b/yt_dlp/extractor/youtube.py
@@ -397,8 +397,8 @@ def _check_login_required(self):
if self._LOGIN_REQUIRED and not self._cookies_passed:
self.raise_login_required('Login details are needed to download this content', method='cookies')
- _YT_INITIAL_DATA_RE = r'(?:window\s*\[\s*["\']ytInitialData["\']\s*\]|ytInitialData)\s*=\s*({.+?})\s*;'
- _YT_INITIAL_PLAYER_RESPONSE_RE = r'ytInitialPlayerResponse\s*=\s*({.+?})\s*;'
+ _YT_INITIAL_DATA_RE = r'(?:window\s*\[\s*["\']ytInitialData["\']\s*\]|ytInitialData)\s*=\s*({.+})\s*;'
+ _YT_INITIAL_PLAYER_RESPONSE_RE = r'ytInitialPlayerResponse\s*=\s*({.+})\s*;'
_YT_INITIAL_BOUNDARY_RE = r'(?:var\s+meta|</script|\n)'
def _get_default_ytcfg(self, client='web'):
@@ -2743,9 +2743,10 @@ def _extract_chapters(self, chapter_list, chapter_time, chapter_title, duration)
return chapters
def _extract_yt_initial_variable(self, webpage, regex, video_id, name):
- return self._parse_json(self._search_regex(
- (fr'{regex}\s*{self._YT_INITIAL_BOUNDARY_RE}',
- regex), webpage, name, default='{}'), video_id, fatal=False)
+ return self._parse_json(
+ self._search_regex(regex, webpage, name, default='{}'),
+ video_id, fatal=False,
+ transform_source=lambda x: '{%s}' % JSInterpreter._separate_at_paren(x, '}')[0])
def _extract_comment(self, comment_renderer, parent=None):
comment_id = comment_renderer.get('commentId')
diff --git a/yt_dlp/jsinterp.py b/yt_dlp/jsinterp.py
index 70857b798..56229cd99 100644
--- a/yt_dlp/jsinterp.py
+++ b/yt_dlp/jsinterp.py
@@ -24,6 +24,7 @@
_NAME_RE = r'[a-zA-Z_$][a-zA-Z_$0-9]*'
_MATCHING_PARENS = dict(zip('({[', ')}]'))
+_QUOTES = '\'"'
class JS_Break(ExtractorError):
@@ -69,12 +70,17 @@ def _separate(expr, delim=',', max_split=None):
return
counters = {k: 0 for k in _MATCHING_PARENS.values()}
start, splits, pos, delim_len = 0, 0, 0, len(delim) - 1
+ in_quote, escaping = None, False
for idx, char in enumerate(expr):
if char in _MATCHING_PARENS:
counters[_MATCHING_PARENS[char]] += 1
elif char in counters:
counters[char] -= 1
- if char != delim[pos] or any(counters.values()):
+ elif not escaping and char in _QUOTES and in_quote in (char, None):
+ in_quote = None if in_quote else char
+ escaping = not escaping and in_quote and char == '\\'
+
+ if char != delim[pos] or any(counters.values()) or in_quote:
pos = 0
continue
elif pos != delim_len:
But when I thought about it more, this is what json.loads
already does in JSONDecoder.raw_decode
. The only difference is that the stdlib raises when the unparsed section is not just whitespace. So we can just catch that error, trim the json at the point of error, and try to parse it again. This is how I ended up with the current implementation.
Another solution could be to create a custom parser.
diff --git a/yt_dlp/extractor/common.py b/yt_dlp/extractor/common.py
index b24599d5f..d43280b07 100644
--- a/yt_dlp/extractor/common.py
+++ b/yt_dlp/extractor/common.py
@@ -35,6 +35,7 @@
ExtractorError,
GeoRestrictedError,
GeoUtils,
+ LenientJSONDecoder,
RegexNotFoundError,
UnsupportedError,
age_restricted,
@@ -1033,11 +1034,11 @@ def _download_json(
expected_status=expected_status)
return res if res is False else res[0]
- def _parse_json(self, json_string, video_id, transform_source=None, fatal=True):
+ def _parse_json(self, json_string, video_id, transform_source=None, fatal=True, lenient=False):
if transform_source:
json_string = transform_source(json_string)
try:
- return json.loads(json_string, strict=False)
+ return json.loads(json_string, strict=False, cls=LenientJSONDecoder if lenient else None)
except ValueError as ve:
errmsg = '%s: Failed to parse JSON ' % video_id
if fatal:
diff --git a/yt_dlp/extractor/youtube.py b/yt_dlp/extractor/youtube.py
index 245778dff..ee36c229f 100644
--- a/yt_dlp/extractor/youtube.py
+++ b/yt_dlp/extractor/youtube.py
@@ -397,8 +397,8 @@ def _check_login_required(self):
if self._LOGIN_REQUIRED and not self._cookies_passed:
self.raise_login_required('Login details are needed to download this content', method='cookies')
- _YT_INITIAL_DATA_RE = r'(?:window\s*\[\s*["\']ytInitialData["\']\s*\]|ytInitialData)\s*=\s*({.+?})\s*;'
- _YT_INITIAL_PLAYER_RESPONSE_RE = r'ytInitialPlayerResponse\s*=\s*({.+?})\s*;'
+ _YT_INITIAL_DATA_RE = r'(?:window\s*\[\s*["\']ytInitialData["\']\s*\]|ytInitialData)\s*=\s*({.+})\s*;'
+ _YT_INITIAL_PLAYER_RESPONSE_RE = r'ytInitialPlayerResponse\s*=\s*({.+})\s*;'
_YT_INITIAL_BOUNDARY_RE = r'(?:var\s+meta|</script|\n)'
def _get_default_ytcfg(self, client='web'):
@@ -2754,7 +2754,7 @@ def _extract_chapters(self, chapter_list, chapter_time, chapter_title, duration)
def _extract_yt_initial_variable(self, webpage, regex, video_id, name):
return self._parse_json(self._search_regex(
(fr'{regex}\s*{self._YT_INITIAL_BOUNDARY_RE}',
- regex), webpage, name, default='{}'), video_id, fatal=False)
+ regex), webpage, name, default='{}'), video_id, fatal=False, lenient=True)
def _extract_comment(self, comment_renderer, parent=None):
comment_id = comment_renderer.get('commentId')
diff --git a/yt_dlp/utils.py b/yt_dlp/utils.py
index b0300b724..ee858afaf 100644
--- a/yt_dlp/utils.py
+++ b/yt_dlp/utils.py
@@ -5381,6 +5381,13 @@ def __repr__(self):
return f'{type(self).__name__}({", ".join(f"{k}={v}" for k, v in self)})'
+class LenientJSONDecoder(json.JSONDecoder):
+ """JSONDecoder that ignores excess text"""
+
+ def decode(self, s):
+ return self.raw_decode(s.lstrip())[0]
+
+
# Deprecated
has_certifi = bool(certifi)
has_websockets = bool(websockets)
PS: Feel free to copy the code for any of these solutions (I honestly wouldn't recommend the regex though)
Yes, not a hack at all, or rather, an excellent hack, when you actually read the code properly. Finding the end of a JSON block with regex is clearly unviable in general so it's much better to use the parser from the json module.
I'm told that the Go JSON parser has this lenience built in, which is why the ytarchive change mentioned above also did this:
... regex must be changed to greedy since we can handle over-capturing.
One comment:
...
try:
# should be outside the try block?
if transform_source:
json_string = transform_source(json_string)
except ExtractorError as e:
...
Adding a decoder
kw (default json.JSONDecoder
) to _parse_json()
might be good as some APIs may send XML or some other return in certain cases (error, eg) and it would be easier to handle that with a decoder class than transform_source
. It could also replace transform_source
though fatal
handling as above would be less straightforward. A function as below could be applied to make a class for a transform_source
function:
def json_transformer(transform_source):
class xf(json.JSONDecoder):
# in CPython 2.7 decode() will call this raw_decode()
# with secret kwargs: check other implementations
def raw_decode(self, s, **kwargs):
s = transform_source(s)
return super(xf, self).raw_decode(s, **kwargs)
return xf
Or the transform_source
could be tested and treated as a decoder class:
def _parse_json(self, json_string, video_id, transform_source=None, fatal=True):
if type(transform_source) == 'function':
try:
json_string = transform_source(json_string)
except ExtractorError as e:
if not fatal:
self.report_warning('{0}: Failed to transform JSON: {1}'.format(video_id, e))
raise
try:
# allow duck typing, not just subclass of JSONDecoder
if type(transform_source) == 'type':
return json.loads(json_string, strict=False, cls=transform_source)
return json.loads(json_string, strict=False)
except ValueError as ve:
...
Just did "--write-pages" and attach the resulting files.
Thanks. I checked the YT page that you dumped, and the same problem that I analysed above applies. So it should be fixed by the modified YT extractor.
FYI, yt-dlp's implementation has been changed to use a custom decoder. https://github.com/yt-dlp/yt-dlp/commit/b7c47b743871cdf3e0de75b17e4454d987384bf9
as some APIs may send XML or some other return in certain cases (error, eg) and it would be easier to handle that with a decoder class than transform_source
Why would you use _parse_json for XML? There is a different parser for it
Checklist
Verbose log
Description
Trying to download the YouTube members only video, and unable to download the video. I'm currently a member of the channel, and able to watch the YouTube video at the website. I downloaded the most recent cookies and applied it on the script.
I'm able to download the other videos that do not require "members only" using Youtube-dl. I was able to download the members only video about two weeks ago without a problem.