Open TechComet opened 2 years ago
It looks like TT is sometimes sending a page with unexpected format, such as an error page. The problem URL was successfully extracted several times when I tested just now.
If you can stimulate the error, use --write-pages
to save the downloaded HTML, and we can then analyse it.
yt-dl 'https://www.tiktok.com/@aamora_3mk/video/7028702876205632773' --write-pages
Thanks, that shows a new format, and now I'm seeing it too.
It seems that TT is in the middle of switching its framework from NextJS to Sigi, and the persisted state JSON sent in the page is changing as a result. Instead of a <script>
element with id
__NEXT_DATA__
, we get one with id
sigi_persisted_state
and JSON with a slightly different structure.
This patch deals with both types of page format:
--- old/youtube-dl/youtube_dl/extractor/tiktok.py
+++ new/youtube-dl/youtube_dl/extractor/tiktok.py
@@ -108,6 +108,15 @@
video_id = self._match_id(url)
webpage = self._download_webpage(url, video_id)
page_props = self._parse_json(self._search_regex(
+ r'''(?s)<script[^>]+\bid=(?P<q>"|'|\b)sigi-persisted-data(?P=q)[^>]+>\s*=\s*(?P<json>{.+?})\s*</script''',
+ webpage, 'sigi data', default='{}', group='json'), video_id)
+ data = try_get(page_props, lambda x: x['ItemModule'][video_id]['video'], dict)
+ if data:
+ data = page_props['ItemModule'][video_id]
+ if data.get('privateItem'):
+ raise ExtractorError('This video is private', expected=True)
+ return self._extract_video(data, video_id)
+ page_props = self._parse_json(self._search_regex(
r'<script[^>]+\bid=["\']__NEXT_DATA__[^>]+>\s*({.+?})\s*</script',
webpage, 'data'), video_id)['props']['pageProps']
data = try_get(page_props, lambda x: x['itemInfo']['itemStruct'], dict)
Why not send update to github ytdl-org/youtube-dl
?
I try this changes .. but It's not work
Why not send update to github
ytdl-org/youtube-dl
?
Actually, @wranai had already posted the PR linked above.
My revised patch:
--- old/youtube-dl/youtube_dl/extractor/tiktok.py
+++ new/youtube-dl/youtube_dl/extractor/tiktok.py
@@ -15,10 +15,11 @@
class TikTokBaseIE(InfoExtractor):
def _extract_video(self, data, video_id=None):
- video = data['video']
- description = str_or_none(try_get(data, lambda x: x['desc']))
- width = int_or_none(try_get(data, lambda x: video['width']))
- height = int_or_none(try_get(data, lambda x: video['height']))
+ video = try_get(data, lambda x: x['video'], dict)
+ if not video:
+ return
+ width = int_or_none(video.get('width'))
+ height = int_or_none(video.get('height'))
format_urls = set()
formats = []
@@ -43,30 +44,32 @@
thumbnail = url_or_none(video.get('cover'))
duration = float_or_none(video.get('duration'))
- uploader = try_get(data, lambda x: x['author']['nickname'], compat_str)
- uploader_id = try_get(data, lambda x: x['author']['id'], compat_str)
+ author = data.get('author')
+ if isinstance(author, dict):
+ uploader_id = author.get('id')
+ else:
+ uploader_id = data.get('authorId')
+ author = data
+ uploader = str_or_none(author.get('nickname'))
timestamp = int_or_none(data.get('createTime'))
- def stats(key):
- return int_or_none(try_get(
- data, lambda x: x['stats']['%sCount' % key]))
-
- view_count = stats('play')
- like_count = stats('digg')
- comment_count = stats('comment')
- repost_count = stats('share')
+ stats = try_get(data, lambda x: x['stats'], dict)
+ view_count, like_count, comment_count, repost_count = [
+ stats and int_or_none(stats.get('%sCount' % key))
+ for key in ('play', 'digg', 'comment', 'share', )]
aweme_id = data.get('id') or video_id
return {
'id': aweme_id,
+ 'display_id': video_id,
'title': uploader or aweme_id,
- 'description': description,
+ 'description': str_or_none(data.get('desc')),
'thumbnail': thumbnail,
'duration': duration,
'uploader': uploader,
- 'uploader_id': uploader_id,
+ 'uploader_id': str_or_none(uploader_id),
'timestamp': timestamp,
'view_count': view_count,
'like_count': like_count,
@@ -84,11 +87,11 @@
'info_dict': {
'id': '6606727368545406213',
'ext': 'mp4',
- 'title': 'Zureeal',
+ 'title': 'md5:24acc456b62b938a7e2dd88e978b20d9',
'description': '#bowsette#mario#cosplay#uk#lgbt#gaming#asian#bowsettecosplay',
'thumbnail': r're:^https?://.*',
'duration': 15,
- 'uploader': 'Zureeal',
+ 'uploader': 'md5:24acc456b62b938a7e2dd88e978b20d9',
'uploader_id': '188294915489964032',
'timestamp': 1538248586,
'upload_date': '20180929',
@@ -108,8 +111,17 @@
video_id = self._match_id(url)
webpage = self._download_webpage(url, video_id)
page_props = self._parse_json(self._search_regex(
- r'<script[^>]+\bid=["\']__NEXT_DATA__[^>]+>\s*({.+?})\s*</script',
- webpage, 'data'), video_id)['props']['pageProps']
+ r'''(?s)<script\s[^>]*?\bid\s*=\s*(?P<q>"|'|\b)sigi-persisted-data(?P=q)[^>]*>[^=]*=\s*(?P<json>{.+?})\s*(?:;[^<]+)?</script''',
+ webpage, 'sigi data', default='{}', group='json'), video_id)
+ data = try_get(page_props, lambda x: x['ItemModule'][video_id]['video'], dict)
+ if data:
+ data = page_props['ItemModule'][video_id]
+ if data.get('privateItem'):
+ raise ExtractorError('This video is private', expected=True)
+ return self._extract_video(data, video_id)
+ page_props = self._parse_json(self._search_regex(
+ r'''(?s)<script\s[^>]*?\bid\s*=\s*(?P<q>"|'|\b)__NEXT_DATA__(?P=q)[^>]*>\s*(?P<json>{.+?})\s*</script''',
+ webpage, 'data', group='json'), video_id)['props']['pageProps']
data = try_get(page_props, lambda x: x['itemInfo']['itemStruct'], dict)
if not data and page_props.get('statusCode') == 10216:
raise ExtractorError('This video is private', expected=True)
Apparently some fetch attempts get a page that contains no media information: neither the __NEXT_DATA__
nor the sigi-persisted-data
. You just have to try again.
A couple of tests support [report that all attempts were showing ]. Perhaps TikTok was trying out the two different toolkits and has settled on Next.js. No-one is reporting issues with yt-dlp which uses __NEXT_DATA__
__NEXT_DATA__
, though that extractor downloads the page twice, commenting that videos may get 403 otherwise.
See https://medium.com/@szdc/reverse-engineering-the-musical-ly-api-662331008eb3, for example.
[For younger readers like myself, musical.ly is what became tiktok]
Hey guys,
Sorry to bother you with this, but I just can't seem to make your patch work... Had no issues with you YouTube throttling patch. Works great and really appreciate your work! With this one, I'm just lacking knowledge or scripting skill. I'm always getting. [TikTok] Setting up session ERROR: Unable to download webpage: [Errno 60] Operation timed out (caused by error(60, 'Operation timed out')); please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; type youtube-dl -U to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.
@dirkf Is there a way to just share a working tiktok.py or update this PR here? https://github.com/ytdl-org/youtube-dl/pull/30224
My apologies for confusion and incompetence.
That looks like a networking issue separate from any tiktok extractor problems.
We could do with a verbose format extraction log (-v -F
) of the unmodified 2021.12.17 version (it shouldn't matter if you've updated extractor/youtube.py
), and then the same with the extractor/tiktok.py
from the PR. I think the PR should fix the same issue that my patch does, though there may be edge cases.
Well, this is definitely not a network problem other websites work. PR does not include a single line of your patch and it does not work. here's the log:
./youtube-dl -v -F https://www.tiktok.com/@manoloteachesgolf/video/6987753546036923653 [debug] System config: [] [debug] User config: [] [debug] Custom config: [] [debug] Command-line args: [u'-v', u'-F', u'https://www.tiktok.com/@manoloteachesgolf/video/6987753546036923653'] [debug] Encodings: locale UTF-8, fs utf-8, out UTF-8, pref UTF-8 [debug] youtube-dl version 2021.12.17 [debug] Python version 2.7.16 (CPython) - Darwin-20.6.0-arm64-arm-64bit [debug] exe versions: none [debug] Proxy map: {} [TikTok] Setting up session ERROR: Unable to download webpage: [Errno 60] Operation timed out (caused by error(60, 'Operation timed out')); please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; type youtube-dl -U to update. Be sure to call youtube-dl with the --verbose flag and include its complete output. File "./youtube-dl/youtube_dl/extractor/common.py", line 634, in _request_webpage return self._downloader.urlopen(url_or_request) File "./youtube-dl/youtube_dl/YoutubeDL.py", line 2288, in urlopen return self._opener.open(req, timeout=self._socket_timeout) File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 429, in open response = self._open(req, data) File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 447, in _open '_open', req) File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 407, in _call_chain result = func(*args) File "./youtube-dl/youtube_dl/utils.py", line 1207, in https_open req, **kwargs) File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 1201, in do_open r = h.getresponse(buffering=True) File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 1134, in getresponse response.begin() File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 442, in begin version, status, reason = self._read_status() File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 398, in _read_status line = self.fp.readline(_MAXLINE + 1) File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/socket.py", line 480, in readline data = self._sock.recv(self._rbufsize) File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/ssl.py", line 754, in recv return self.read(buflen) File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/ssl.py", line 641, in read v = self._sslobj.read(len)
PR #30224 does include the necessary extra code for SIGI-type pages, but unfortunately, if understandably, the author didn't take up my suggestion to combine the additional code from my patch.
The timeout issue (Error 60
in Windows, SSLError('The read operation timed out',)
in Linux), is some weird blocking done by whatever fronts TikTok's pages (Akamai, apparenty). In order to download the page for parsing, some cookie has to be sent and a way to get it is to make a previous request to the site. In yt-dl, and also with my patch and with PR #30224, the extractor fetches https://www.tiktok.com/
before doing anything else. In yt-dlp, which works with your URL using the latest GIT source, the code fetches the webpage itself twice, commenting that you get 403 otherwise.
A small change to the PR #30224 code improves on the yt-dlp approach: instead of fetching the whole page (GET request), just send a HEAD request; if a page is actually returned, rather than an error with a Set-Cookie
header, it doesn't actually have to be downloaded.
--- old/youtube-dl/youtube_dl/extractor/tiktok.py
+++ new/youtube-dl/youtube_dl/extractor/tiktok.py
@@ -6,6 +6,7 @@
compat_str,
ExtractorError,
float_or_none,
+ HEADRequest,
int_or_none,
str_or_none,
try_get,
@@ -99,18 +100,27 @@
}
}]
- def _real_initialize(self):
- # Setup session (will set necessary cookies)
- self._request_webpage(
- 'https://www.tiktok.com/', None, note='Setting up session')
-
def _real_extract(self, url):
video_id = self._match_id(url)
+
+ # dummy request to set cookies
+ self._request_webpage(
+ HEADRequest(url), video_id,
+ note=False, errnote='Could not send HEAD request to %s' % url,
+ fatal=False)
webpage = self._download_webpage(url, video_id)
- page_props = self._parse_json(self._search_regex(
- r'<script[^>]+\bid=["\']__NEXT_DATA__[^>]+>\s*({.+?})\s*</script',
- webpage, 'data'), video_id)['props']['pageProps']
- data = try_get(page_props, lambda x: x['itemInfo']['itemStruct'], dict)
+ try:
+ page_props = self._parse_json(self._search_regex(
+ r'<script[^>]+\bid=["\']__NEXT_DATA__[^>]+>\s*({.+?})\s*</script',
+ webpage, 'data'), video_id)['props']['pageProps']
+ data = try_get(page_props, lambda x: x['itemInfo']['itemStruct'], dict)
+ except:
+ page_props = self._parse_json(self._search_regex(
+ r'<script[^>]+\bid=["\']sigi-persisted-data[^>]+>window\[\'SIGI_STATE\']=({.+?});window\[',
+ webpage, 'data'), video_id)
+ data = try_get(page_props, lambda x: x['ItemModule'][video_id], dict)
+ author = try_get(page_props, lambda x: x['UserModule']['users'][data['author']], dict)
+ data['author'] = author
if not data and page_props.get('statusCode') == 10216:
raise ExtractorError('This video is private', expected=True)
return self._extract_video(data, video_id)
Then:
$ python -m youtube_dl -v -F 'https://www.tiktok.com/@manoloteachesgolf/video/6987753546036923653'
[debug] System config: [u'--prefer-ffmpeg']
[debug] User config: []
[debug] Custom config: []
[debug] Command-line args: [u'-v', u'-F', u'https://www.tiktok.com/@manoloteachesgolf/video/6987753546036923653']
[debug] Encodings: locale UTF-8, fs UTF-8, out UTF-8, pref UTF-8
[debug] youtube-dl version 2021.06.06
[debug] Git HEAD: bd7d796ef
[debug] Python version 2.7.17 (CPython) - Linux-4.4.0-210-generic-i686-with-Ubuntu-16.04-xenial
[debug] exe versions: avconv 4.3, avprobe 4.3, ffmpeg 4.3, ffprobe 4.3
[debug] Proxy map: {}
[TikTok] 6987753546036923653: Downloading webpage
[info] Available formats for 6987753546036923653:
format code extension resolution note
0 mp4 576x1024
df@Spiridion:~/Documents/src/youtube-dl$ python -m youtube_dl -v -F 'https://www.tiktok.com/@manoloteachesgolf/video/6987753546036923653'
[debug] System config: [u'--prefer-ffmpeg']
[debug] User config: []
[debug] Custom config: []
[debug] Command-line args: [u'-v', u'-F', u'https://www.tiktok.com/@manoloteachesgolf/video/6987753546036923653']
[debug] Encodings: locale UTF-8, fs UTF-8, out UTF-8, pref UTF-8
[debug] youtube-dl version 2021.12.17
[debug] Git HEAD: bd7d796ef
[debug] Python version 2.7.17 (CPython) - Linux-4.4.0-210-generic-i686-with-Ubuntu-16.04-xenial
[debug] exe versions: avconv 4.3, avprobe 4.3, ffmpeg 4.3, ffprobe 4.3
[debug] Proxy map: {}
[TikTok] 6987753546036923653: Downloading webpage
[info] Available formats for 6987753546036923653:
format code extension resolution note
0 mp4 576x1024
$
I guess we need a new PR.
WOW! thank you so much for such a detailed reply! Implemented this patch and it works for me! appreciate your help and patience!
PR #30479 linked above should address all current issues, with this version of the extractor. Please report any issues there.
Looks good so far and just downloaded a small batch of videos that didn't work before.
To download on tiktok you only have to right click on the video and chose "Download the video"
Apparently we now have to pretend to be a mobile phone (Windows: '' -> ""):
--user-agent 'Mozilla/5.0 (iPhone; CPU iPhone OS 10_3 like Mac OS X) AppleWebKit/602.1.50 (KHTML, like Gecko) CriOS/56.0.2924.75 Mobile/14E5239e Safari/602.1'
Passing --user-agent 'Mozilla/5.0 (iPhone; CPU iPhone OS 10_3 like Mac OS X) AppleWebKit/602.1.50 (KHTML, like Gecko) CriOS/56.0.2924.75 Mobile/14E5239e Safari/602.1'
as suggested by @dirkf got me further, but shortly encountered a different error:
[TikTok] Setting up session
[TikTok] 7193925570105806126: Downloading webpage
ERROR: Unable to extract data; please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see https://yt-dl.org/update on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.
--verbose
youtube-dl https://www.tiktok.com/@special_head/video/7193925570105806126/ --user-agent 'Mozilla/5.0 (iPhone; CPU iPhone OS 10_3 like Mac OS X) AppleWebKit/602.1.50 (KHTML, like Gecko) CriOS/56.0.2924.75 Mobile/14E5239e Safari/602.1' --verbose [debug] System config: [] [debug] User config: [] [debug] Custom config: [] [debug] Command-line args: ['https://www.tiktok.com/@special_head/video/7193925570105806126/', '--user-agent', 'Mozilla/5.0 (iPhone; CPU iPhone OS 10_3 like Mac OS X) AppleWebKit/602.1.50 (KHTML, like Gecko) CriOS/56.0.2924.75 Mobile/14E5239e Safari/602.1', '--verbose'] [debug] Encodings: locale UTF-8, fs utf-8, out utf-8, pref UTF-8 [debug] youtube-dl version 2021.12.17 [debug] Git HEAD: 4b3d64d30 [debug] Python version 3.11.1 (CPython) - macOS-13.2-arm64-arm-64bit [debug] exe versions: ffmpeg 5.1.2, ffprobe 5.1.2, rtmpdump 2.4 [debug] Proxy map: {} [TikTok] Setting up session [TikTok] 7193925570105806126: Downloading webpage ERROR: Unable to extract data; please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see https://yt-dl.org/update on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output. Traceback (most recent call last): File "/opt/homebrew/Cellar/youtube-dl/2021.12.17/libexec/lib/python3.11/site-packages/youtube_dl/YoutubeDL.py", line 815, in wrapper return func(self, *args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/homebrew/Cellar/youtube-dl/2021.12.17/libexec/lib/python3.11/site-packages/youtube_dl/YoutubeDL.py", line 836, in __extract_info ie_result = ie.extract(url) ^^^^^^^^^^^^^^^ File "/opt/homebrew/Cellar/youtube-dl/2021.12.17/libexec/lib/python3.11/site-packages/youtube_dl/extractor/common.py", line 534, in extract ie_result = self._real_extract(url) ^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/homebrew/Cellar/youtube-dl/2021.12.17/libexec/lib/python3.11/site-packages/youtube_dl/extractor/tiktok.py", line 110, in _real_extract page_props = self._parse_json(self._search_regex( ^^^^^^^^^^^^^^^^^^^ File "/opt/homebrew/Cellar/youtube-dl/2021.12.17/libexec/lib/python3.11/site-packages/youtube_dl/extractor/common.py", line 1012, in _search_regex raise RegexNotFoundError('Unable to extract %s' % _name) youtube_dl.utils.RegexNotFoundError: Unable to extract data; please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see https://yt-dl.org/update on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.
If helpful, yt-dlp
seems to have some solution worked out, no additional flags needed.
Checklist
yes
yes
yes
yes
yes
Verbose log
Description
Sometime it's works but sometime not work!