ytdl-org / youtube-dl

Command-line program to download videos from YouTube.com and other video sites
http://ytdl-org.github.io/youtube-dl/
The Unlicense
131.33k stars 9.95k forks source link

Tiktok not get video url sometimes #30251

Open TechComet opened 2 years ago

TechComet commented 2 years ago

Checklist

Verbose log

youtube-dl -g 'https://www.tiktok.com/@aamora_3mk/video/7028702876205632773' -v
[debug] System config: []
[debug] User config: []
[debug] Custom config: []
[debug] Command-line args: ['-g', 'https://www.tiktok.com/@aamora_3mk/video/7028702876205632773', '-v']
[debug] Encodings: locale UTF-8, fs utf-8, out utf-8, pref UTF-8
[debug] youtube-dl version 2021.06.06
[debug] Python version 3.8.10 (CPython) - Linux-5.11.0-40-generic-x86_64-with-glibc2.29
[debug] exe versions: ffmpeg 4.2.4, ffprobe 4.2.4
[debug] Proxy map: {}
ERROR: Unable to extract data; please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; type  youtube-dl -U  to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.
Traceback (most recent call last):
  File "/usr/local/bin/youtube-dl/youtube_dl/YoutubeDL.py", line 815, in wrapper
    return func(self, *args, **kwargs)
  File "/usr/local/bin/youtube-dl/youtube_dl/YoutubeDL.py", line 836, in __extract_info
    ie_result = ie.extract(url)
  File "/usr/local/bin/youtube-dl/youtube_dl/extractor/common.py", line 534, in extract
    ie_result = self._real_extract(url)
  File "/usr/local/bin/youtube-dl/youtube_dl/extractor/tiktok.py", line 110, in _real_extract
    page_props = self._parse_json(self._search_regex(
  File "/usr/local/bin/youtube-dl/youtube_dl/extractor/common.py", line 1012, in _search_regex
    raise RegexNotFoundError('Unable to extract %s' % _name)
youtube_dl.utils.RegexNotFoundError: Unable to extract data; please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; type  youtube-dl -U  to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.

Description

Sometime it's works but sometime not work!

dirkf commented 2 years ago

It looks like TT is sometimes sending a page with unexpected format, such as an error page. The problem URL was successfully extracted several times when I tested just now.

If you can stimulate the error, use --write-pages to save the downloaded HTML, and we can then analyse it.

TechComet commented 2 years ago

yt-dl 'https://www.tiktok.com/@aamora_3mk/video/7028702876205632773' --write-pages

aamora_3mk_video_7028702876205632773.dump.zip

dirkf commented 2 years ago

Thanks, that shows a new format, and now I'm seeing it too.

It seems that TT is in the middle of switching its framework from NextJS to Sigi, and the persisted state JSON sent in the page is changing as a result. Instead of a <script> element with id __NEXT_DATA__, we get one with id sigi_persisted_state and JSON with a slightly different structure.

This patch deals with both types of page format:

--- old/youtube-dl/youtube_dl/extractor/tiktok.py
+++ new/youtube-dl/youtube_dl/extractor/tiktok.py
@@ -108,6 +108,15 @@
         video_id = self._match_id(url)
         webpage = self._download_webpage(url, video_id)
         page_props = self._parse_json(self._search_regex(
+            r'''(?s)<script[^>]+\bid=(?P<q>"|'|\b)sigi-persisted-data(?P=q)[^>]+>\s*=\s*(?P<json>{.+?})\s*</script''',
+            webpage, 'sigi data', default='{}', group='json'), video_id)
+        data = try_get(page_props, lambda x: x['ItemModule'][video_id]['video'], dict)
+        if data:
+            data = page_props['ItemModule'][video_id]
+            if data.get('privateItem'):
+                raise ExtractorError('This video is private', expected=True)
+            return self._extract_video(data, video_id)
+        page_props = self._parse_json(self._search_regex(
             r'<script[^>]+\bid=["\']__NEXT_DATA__[^>]+>\s*({.+?})\s*</script',
             webpage, 'data'), video_id)['props']['pageProps']
         data = try_get(page_props, lambda x: x['itemInfo']['itemStruct'], dict)
TechComet commented 2 years ago

Why not send update to github ytdl-org/youtube-dl?

TechComet commented 2 years ago

I try this changes .. but It's not work

dirkf commented 2 years ago

Why not send update to github ytdl-org/youtube-dl?

Actually, @wranai had already posted the PR linked above.

My revised patch:

--- old/youtube-dl/youtube_dl/extractor/tiktok.py
+++ new/youtube-dl/youtube_dl/extractor/tiktok.py
@@ -15,10 +15,11 @@

 class TikTokBaseIE(InfoExtractor):
     def _extract_video(self, data, video_id=None):
-        video = data['video']
-        description = str_or_none(try_get(data, lambda x: x['desc']))
-        width = int_or_none(try_get(data, lambda x: video['width']))
-        height = int_or_none(try_get(data, lambda x: video['height']))
+        video = try_get(data, lambda x: x['video'], dict)
+        if not video:
+            return
+        width = int_or_none(video.get('width'))
+        height = int_or_none(video.get('height'))

         format_urls = set()
         formats = []
@@ -43,30 +44,32 @@
         thumbnail = url_or_none(video.get('cover'))
         duration = float_or_none(video.get('duration'))

-        uploader = try_get(data, lambda x: x['author']['nickname'], compat_str)
-        uploader_id = try_get(data, lambda x: x['author']['id'], compat_str)
+        author = data.get('author')
+        if isinstance(author, dict):
+            uploader_id = author.get('id')
+        else:
+            uploader_id = data.get('authorId')
+            author = data
+        uploader = str_or_none(author.get('nickname'))

         timestamp = int_or_none(data.get('createTime'))

-        def stats(key):
-            return int_or_none(try_get(
-                data, lambda x: x['stats']['%sCount' % key]))
-
-        view_count = stats('play')
-        like_count = stats('digg')
-        comment_count = stats('comment')
-        repost_count = stats('share')
+        stats = try_get(data, lambda x: x['stats'], dict)
+        view_count, like_count, comment_count, repost_count = [
+            stats and int_or_none(stats.get('%sCount' % key))
+            for key in ('play', 'digg', 'comment', 'share', )]

         aweme_id = data.get('id') or video_id

         return {
             'id': aweme_id,
+            'display_id': video_id,
             'title': uploader or aweme_id,
-            'description': description,
+            'description': str_or_none(data.get('desc')),
             'thumbnail': thumbnail,
             'duration': duration,
             'uploader': uploader,
-            'uploader_id': uploader_id,
+            'uploader_id': str_or_none(uploader_id),
             'timestamp': timestamp,
             'view_count': view_count,
             'like_count': like_count,
@@ -84,11 +87,11 @@
         'info_dict': {
             'id': '6606727368545406213',
             'ext': 'mp4',
-            'title': 'Zureeal',
+            'title': 'md5:24acc456b62b938a7e2dd88e978b20d9',
             'description': '#bowsette#mario#cosplay#uk#lgbt#gaming#asian#bowsettecosplay',
             'thumbnail': r're:^https?://.*',
             'duration': 15,
-            'uploader': 'Zureeal',
+            'uploader': 'md5:24acc456b62b938a7e2dd88e978b20d9',
             'uploader_id': '188294915489964032',
             'timestamp': 1538248586,
             'upload_date': '20180929',
@@ -108,8 +111,17 @@
         video_id = self._match_id(url)
         webpage = self._download_webpage(url, video_id)
         page_props = self._parse_json(self._search_regex(
-            r'<script[^>]+\bid=["\']__NEXT_DATA__[^>]+>\s*({.+?})\s*</script',
-            webpage, 'data'), video_id)['props']['pageProps']
+            r'''(?s)<script\s[^>]*?\bid\s*=\s*(?P<q>"|'|\b)sigi-persisted-data(?P=q)[^>]*>[^=]*=\s*(?P<json>{.+?})\s*(?:;[^<]+)?</script''',
+            webpage, 'sigi data', default='{}', group='json'), video_id)
+        data = try_get(page_props, lambda x: x['ItemModule'][video_id]['video'], dict)
+        if data:
+            data = page_props['ItemModule'][video_id]
+            if data.get('privateItem'):
+                raise ExtractorError('This video is private', expected=True)
+            return self._extract_video(data, video_id)
+        page_props = self._parse_json(self._search_regex(
+            r'''(?s)<script\s[^>]*?\bid\s*=\s*(?P<q>"|'|\b)__NEXT_DATA__(?P=q)[^>]*>\s*(?P<json>{.+?})\s*</script''',
+            webpage, 'data', group='json'), video_id)['props']['pageProps']
         data = try_get(page_props, lambda x: x['itemInfo']['itemStruct'], dict)
         if not data and page_props.get('statusCode') == 10216:
             raise ExtractorError('This video is private', expected=True)

Apparently some fetch attempts get a page that contains no media information: neither the __NEXT_DATA__ nor the sigi-persisted-data. You just have to try again.

dirkf commented 2 years ago

A couple of tests support [report that all attempts were showing __NEXT_DATA__]. Perhaps TikTok was trying out the two different toolkits and has settled on Next.js. No-one is reporting issues with yt-dlp which uses __NEXT_DATA__, though that extractor downloads the page twice, commenting that videos may get 403 otherwise.

dirkf commented 2 years ago

See https://medium.com/@szdc/reverse-engineering-the-musical-ly-api-662331008eb3, for example.

[For younger readers like myself, musical.ly is what became tiktok]

someziggyman commented 2 years ago

Hey guys,

Sorry to bother you with this, but I just can't seem to make your patch work... Had no issues with you YouTube throttling patch. Works great and really appreciate your work! With this one, I'm just lacking knowledge or scripting skill. I'm always getting. [TikTok] Setting up session ERROR: Unable to download webpage: [Errno 60] Operation timed out (caused by error(60, 'Operation timed out')); please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; type youtube-dl -U to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.

@dirkf Is there a way to just share a working tiktok.py or update this PR here? https://github.com/ytdl-org/youtube-dl/pull/30224

My apologies for confusion and incompetence.

dirkf commented 2 years ago

That looks like a networking issue separate from any tiktok extractor problems.

We could do with a verbose format extraction log (-v -F) of the unmodified 2021.12.17 version (it shouldn't matter if you've updated extractor/youtube.py), and then the same with the extractor/tiktok.py from the PR. I think the PR should fix the same issue that my patch does, though there may be edge cases.

someziggyman commented 2 years ago

Well, this is definitely not a network problem other websites work. PR does not include a single line of your patch and it does not work. here's the log:

./youtube-dl -v -F https://www.tiktok.com/@manoloteachesgolf/video/6987753546036923653 [debug] System config: [] [debug] User config: [] [debug] Custom config: [] [debug] Command-line args: [u'-v', u'-F', u'https://www.tiktok.com/@manoloteachesgolf/video/6987753546036923653'] [debug] Encodings: locale UTF-8, fs utf-8, out UTF-8, pref UTF-8 [debug] youtube-dl version 2021.12.17 [debug] Python version 2.7.16 (CPython) - Darwin-20.6.0-arm64-arm-64bit [debug] exe versions: none [debug] Proxy map: {} [TikTok] Setting up session ERROR: Unable to download webpage: [Errno 60] Operation timed out (caused by error(60, 'Operation timed out')); please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; type youtube-dl -U to update. Be sure to call youtube-dl with the --verbose flag and include its complete output. File "./youtube-dl/youtube_dl/extractor/common.py", line 634, in _request_webpage return self._downloader.urlopen(url_or_request) File "./youtube-dl/youtube_dl/YoutubeDL.py", line 2288, in urlopen return self._opener.open(req, timeout=self._socket_timeout) File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 429, in open response = self._open(req, data) File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 447, in _open '_open', req) File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 407, in _call_chain result = func(*args) File "./youtube-dl/youtube_dl/utils.py", line 1207, in https_open req, **kwargs) File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 1201, in do_open r = h.getresponse(buffering=True) File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 1134, in getresponse response.begin() File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 442, in begin version, status, reason = self._read_status() File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 398, in _read_status line = self.fp.readline(_MAXLINE + 1) File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/socket.py", line 480, in readline data = self._sock.recv(self._rbufsize) File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/ssl.py", line 754, in recv return self.read(buflen) File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/ssl.py", line 641, in read v = self._sslobj.read(len)

dirkf commented 2 years ago

PR #30224 does include the necessary extra code for SIGI-type pages, but unfortunately, if understandably, the author didn't take up my suggestion to combine the additional code from my patch.

The timeout issue (Error 60 in Windows, SSLError('The read operation timed out',) in Linux), is some weird blocking done by whatever fronts TikTok's pages (Akamai, apparenty). In order to download the page for parsing, some cookie has to be sent and a way to get it is to make a previous request to the site. In yt-dl, and also with my patch and with PR #30224, the extractor fetches https://www.tiktok.com/ before doing anything else. In yt-dlp, which works with your URL using the latest GIT source, the code fetches the webpage itself twice, commenting that you get 403 otherwise.

A small change to the PR #30224 code improves on the yt-dlp approach: instead of fetching the whole page (GET request), just send a HEAD request; if a page is actually returned, rather than an error with a Set-Cookie header, it doesn't actually have to be downloaded.

--- old/youtube-dl/youtube_dl/extractor/tiktok.py
+++ new/youtube-dl/youtube_dl/extractor/tiktok.py
@@ -6,6 +6,7 @@
     compat_str,
     ExtractorError,
     float_or_none,
+    HEADRequest,
     int_or_none,
     str_or_none,
     try_get,
@@ -99,18 +100,27 @@
         }
     }]

-    def _real_initialize(self):
-        # Setup session (will set necessary cookies)
-        self._request_webpage(
-            'https://www.tiktok.com/', None, note='Setting up session')
-
     def _real_extract(self, url):
         video_id = self._match_id(url)
+
+        # dummy request to set cookies
+        self._request_webpage(
+            HEADRequest(url), video_id,
+            note=False, errnote='Could not send HEAD request to %s' % url,
+            fatal=False)
         webpage = self._download_webpage(url, video_id)
-        page_props = self._parse_json(self._search_regex(
-            r'<script[^>]+\bid=["\']__NEXT_DATA__[^>]+>\s*({.+?})\s*</script',
-            webpage, 'data'), video_id)['props']['pageProps']
-        data = try_get(page_props, lambda x: x['itemInfo']['itemStruct'], dict)
+        try:
+            page_props = self._parse_json(self._search_regex(
+                r'<script[^>]+\bid=["\']__NEXT_DATA__[^>]+>\s*({.+?})\s*</script',
+                webpage, 'data'), video_id)['props']['pageProps']
+            data = try_get(page_props, lambda x: x['itemInfo']['itemStruct'], dict)
+        except:
+            page_props = self._parse_json(self._search_regex(
+                r'<script[^>]+\bid=["\']sigi-persisted-data[^>]+>window\[\'SIGI_STATE\']=({.+?});window\[',
+                webpage, 'data'), video_id)
+            data = try_get(page_props, lambda x: x['ItemModule'][video_id], dict)
+            author = try_get(page_props, lambda x: x['UserModule']['users'][data['author']], dict)
+            data['author'] = author
         if not data and page_props.get('statusCode') == 10216:
             raise ExtractorError('This video is private', expected=True)
         return self._extract_video(data, video_id)

Then:

$ python -m youtube_dl -v -F 'https://www.tiktok.com/@manoloteachesgolf/video/6987753546036923653'
[debug] System config: [u'--prefer-ffmpeg']
[debug] User config: []
[debug] Custom config: []
[debug] Command-line args: [u'-v', u'-F', u'https://www.tiktok.com/@manoloteachesgolf/video/6987753546036923653']
[debug] Encodings: locale UTF-8, fs UTF-8, out UTF-8, pref UTF-8
[debug] youtube-dl version 2021.06.06
[debug] Git HEAD: bd7d796ef
[debug] Python version 2.7.17 (CPython) - Linux-4.4.0-210-generic-i686-with-Ubuntu-16.04-xenial
[debug] exe versions: avconv 4.3, avprobe 4.3, ffmpeg 4.3, ffprobe 4.3
[debug] Proxy map: {}
[TikTok] 6987753546036923653: Downloading webpage
[info] Available formats for 6987753546036923653:
format code  extension  resolution note
0            mp4        576x1024   
df@Spiridion:~/Documents/src/youtube-dl$ python -m youtube_dl -v -F 'https://www.tiktok.com/@manoloteachesgolf/video/6987753546036923653'
[debug] System config: [u'--prefer-ffmpeg']
[debug] User config: []
[debug] Custom config: []
[debug] Command-line args: [u'-v', u'-F', u'https://www.tiktok.com/@manoloteachesgolf/video/6987753546036923653']
[debug] Encodings: locale UTF-8, fs UTF-8, out UTF-8, pref UTF-8
[debug] youtube-dl version 2021.12.17
[debug] Git HEAD: bd7d796ef
[debug] Python version 2.7.17 (CPython) - Linux-4.4.0-210-generic-i686-with-Ubuntu-16.04-xenial
[debug] exe versions: avconv 4.3, avprobe 4.3, ffmpeg 4.3, ffprobe 4.3
[debug] Proxy map: {}
[TikTok] 6987753546036923653: Downloading webpage
[info] Available formats for 6987753546036923653:
format code  extension  resolution note
0            mp4        576x1024   
$

I guess we need a new PR.

someziggyman commented 2 years ago

WOW! thank you so much for such a detailed reply! Implemented this patch and it works for me! appreciate your help and patience!

dirkf commented 2 years ago

PR #30479 linked above should address all current issues, with this version of the extractor. Please report any issues there.

hessijames79 commented 2 years ago

Looks good so far and just downloaded a small batch of videos that didn't work before.

Menard01 commented 2 years ago

To download on tiktok you only have to right click on the video and chose "Download the video"

dirkf commented 2 years ago

Apparently we now have to pretend to be a mobile phone (Windows: '' -> ""): --user-agent 'Mozilla/5.0 (iPhone; CPU iPhone OS 10_3 like Mac OS X) AppleWebKit/602.1.50 (KHTML, like Gecko) CriOS/56.0.2924.75 Mobile/14E5239e Safari/602.1'

unitof commented 1 year ago

Passing --user-agent 'Mozilla/5.0 (iPhone; CPU iPhone OS 10_3 like Mac OS X) AppleWebKit/602.1.50 (KHTML, like Gecko) CriOS/56.0.2924.75 Mobile/14E5239e Safari/602.1' as suggested by @dirkf got me further, but shortly encountered a different error:

[TikTok] Setting up session
[TikTok] 7193925570105806126: Downloading webpage
ERROR: Unable to extract data; please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see  https://yt-dl.org/update  on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.
--verbose
youtube-dl https://www.tiktok.com/@special_head/video/7193925570105806126/ --user-agent 'Mozilla/5.0 (iPhone; CPU iPhone OS 10_3 like Mac OS X) AppleWebKit/602.1.50 (KHTML, like Gecko) CriOS/56.0.2924.75 Mobile/14E5239e Safari/602.1' --verbose
[debug] System config: []
[debug] User config: []
[debug] Custom config: []
[debug] Command-line args: ['https://www.tiktok.com/@special_head/video/7193925570105806126/', '--user-agent', 'Mozilla/5.0 (iPhone; CPU iPhone OS 10_3 like Mac OS X) AppleWebKit/602.1.50 (KHTML, like Gecko) CriOS/56.0.2924.75 Mobile/14E5239e Safari/602.1', '--verbose']
[debug] Encodings: locale UTF-8, fs utf-8, out utf-8, pref UTF-8
[debug] youtube-dl version 2021.12.17
[debug] Git HEAD: 4b3d64d30
[debug] Python version 3.11.1 (CPython) - macOS-13.2-arm64-arm-64bit
[debug] exe versions: ffmpeg 5.1.2, ffprobe 5.1.2, rtmpdump 2.4
[debug] Proxy map: {}
[TikTok] Setting up session
[TikTok] 7193925570105806126: Downloading webpage
ERROR: Unable to extract data; please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see  https://yt-dl.org/update  on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.
Traceback (most recent call last):
  File "/opt/homebrew/Cellar/youtube-dl/2021.12.17/libexec/lib/python3.11/site-packages/youtube_dl/YoutubeDL.py", line 815, in wrapper
    return func(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/youtube-dl/2021.12.17/libexec/lib/python3.11/site-packages/youtube_dl/YoutubeDL.py", line 836, in __extract_info
    ie_result = ie.extract(url)
                ^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/youtube-dl/2021.12.17/libexec/lib/python3.11/site-packages/youtube_dl/extractor/common.py", line 534, in extract
    ie_result = self._real_extract(url)
                ^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/youtube-dl/2021.12.17/libexec/lib/python3.11/site-packages/youtube_dl/extractor/tiktok.py", line 110, in _real_extract
    page_props = self._parse_json(self._search_regex(
                                  ^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/youtube-dl/2021.12.17/libexec/lib/python3.11/site-packages/youtube_dl/extractor/common.py", line 1012, in _search_regex
    raise RegexNotFoundError('Unable to extract %s' % _name)
youtube_dl.utils.RegexNotFoundError: Unable to extract data; please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see  https://yt-dl.org/update  on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.

If helpful, yt-dlp seems to have some solution worked out, no additional flags needed.