ytdl-org / youtube-dl

Command-line program to download videos from YouTube.com and other video sites
http://ytdl-org.github.io/youtube-dl/
The Unlicense
131.34k stars 9.95k forks source link

Twitter got an .onion (TOR) address, ytdl does not accept it #30736

Open legolegs opened 2 years ago

legolegs commented 2 years ago

Checklist

Description

Recently the Twitter got the .onion (TOR) address. See https://help.twitter.com/en/using-twitter/twitter-supported-browsers The address is https://twitter3e4tixl4xyajtrzo62zg5vztmjuricljdp2c5kshju4avyoid.onion/ (yes, the domain name is that long) It even got the proper SSL cert. The example video: https://twitter3e4tixl4xyajtrzo62zg5vztmjuricljdp2c5kshju4avyoid.onion/natchan1984/status/1502504750054461442 Example command line and output (you got to have the torbrowser or a standalone tor daemon running)

$ youtube-dl --proxy socks5://127.0.0.1:9150/ 'https://twitter3e4tixl4xyajtrzo62zg5vztmjuricljdp2c5kshju4avyoid.onion/natchan1984/status/1502504750054461442'
[generic] 1502504750054461442: Requesting header
WARNING: Falling back on generic information extractor.
[generic] 1502504750054461442: Downloading webpage
[generic] 1502504750054461442: Extracting information
ERROR: Unsupported URL: https://twitter3e4tixl4xyajtrzo62zg5vztmjuricljdp2c5kshju4avyoid.onion/natchan1984/status/1502504750054461442

I think the fix should be applied somewhere among those lines: https://github.com/ytdl-org/youtube-dl/blob/6508688e88c83bb811653083db9351702cd39a6a/youtube_dl/extractor/twitter.py#L34 I thought there might exist a way to force the ytdl to use the specific extractor like --dont-look-at-url-i-promise-it-is-really "twitter" but I found no such option. Anyway from now on the "twitter3e4tixl4xyajtrzo62zg5vztmjuricljdp2c5kshju4avyoid.onion" is one of genuine twitter domain names.

dirkf commented 2 years ago

The Twitter extractor doesn't have a twitter:(\d+) URL pattern that would enable that, but it wouldn't help here since the objective is to route any access to Twitter resources via Tor.

See https://github.com/yt-dlp/yt-dlp/issues/3053.

With the patch below, I installed tor and torsocks.

--- old/youtube-dl/youtube_dl/extractor/twitter.py
+++ new/youtube-dl/youtube_dl/extractor/twitter.py
@@ -9,6 +9,7 @@
     compat_parse_qs,
     compat_urllib_parse_unquote,
     compat_urllib_parse_urlparse,
+    compat_urlparse,
 )
 from ..utils import (
     dict_get,
@@ -30,8 +31,11 @@

 class TwitterBaseIE(InfoExtractor):
-    _API_BASE = 'https://api.twitter.com/1.1/'
-    _BASE_REGEX = r'https?://(?:(?:www|m(?:obile)?)\.)?twitter\.com/'
+    _API_BASE_TMPL = 'https://api.%s/1.1/'
+    _TOR_BASE = 'twitter3e4tixl4xyajtrzo62zg5vztmjuricljdp2c5kshju4avyoid.onion'
+    _PLAIN_API_BASE = _API_BASE_TMPL % ('twitter.com', )
+    _TOR_API_BASE = _API_BASE_TMPL % (_TOR_BASE, )
+    _BASE_REGEX = r'https?://(?:(?:www|m(?:obile)?)\.)?(?:twitter\.com|%s)/' % (re.escape(_TOR_BASE), )
     _GUEST_TOKEN = None

     def _extract_variant_formats(self, variant, video_id):
@@ -98,6 +102,13 @@
                     e.cause.read().decode(),
                     video_id)['errors'][0]['message'], expected=True)
             raise
+
+    # Derived classes should call this super if the API is to be used
+    def _real_extract(self, url):
+        self._API_BASE = (
+            self._TOR_API_BASE
+            if compat_urlparse.urlparse(url).hostname.endswith('.onion')
+            else self._PLAIN_API_BASE)

 class TwitterCardIE(InfoExtractor):
@@ -427,9 +438,14 @@
         # poll4choice_video card
         'url': 'https://twitter.com/SouthamptonFC/status/1347577658079641604',
         'only_matching': True,
+    }, {
+        # Tor site
+        'url': 'https://twitter3e4tixl4xyajtrzo62zg5vztmjuricljdp2c5kshju4avyoid.onion/SouthamptonFC/status/1347577658079641604',
+        'only_matching': True,
     }]

     def _real_extract(self, url):
+        super(TwitterIE, self)._real_extract(url)
         twid = self._match_id(url)
         status = self._call_api(
             'statuses/show/%s.json' % twid, twid, {
@@ -650,6 +666,7 @@
     }

     def _real_extract(self, url):
+        super(TwitterBroadcastIE, self)._real_extract(url)
         broadcast_id = self._match_id(url)
         broadcast = self._call_api(
             'broadcasts/show.json', broadcast_id,

Result (even after disabling the plain API URL):

$ torsocks python -m youtube_dl -v -F 'https://twitter3e4tixl4xyajtrzo62zg5vztmjuricljdp2c5kshju4avyoid.onion/freethenipple/status/643211948184596480'
[debug] System config: [u'--prefer-ffmpeg']
[debug] User config: []
[debug] Custom config: []
[debug] Command-line args: [u'-v', u'-F', u'https://twitter3e4tixl4xyajtrzo62zg5vztmjuricljdp2c5kshju4avyoid.onion/freethenipple/status/643211948184596480']
[debug] Encodings: locale UTF-8, fs UTF-8, out UTF-8, pref UTF-8
[debug] youtube-dl version 2021.12.17
[debug] Git HEAD: a631e79b3
[debug] Python version 2.7.17 (CPython) - Linux-4.4.0-210-generic-i686-with-Ubuntu-16.04-xenial
[debug] exe versions: avconv 4.3, avprobe 4.3, ffmpeg 4.3, ffprobe 4.3
[debug] Proxy map: {}
[twitter] 643211948184596480: Downloading guest token
[twitter] 643211948184596480: Downloading JSON metadata
[twitter] 643211948184596480: Downloading m3u8 information
[info] Available formats for 643211948184596480:
format code  extension  resolution note
hls-320      mp4        240x240     320k , avc1.420015, mp4a.40.2
http-320     mp4        240x240     320k 
hls-832      mp4        480x480     832k , avc1.42001f, mp4a.40.2
http-832     mp4        480x480     832k  (best)
$

The media link found in this way:

https://video.twitterhbmit57bzbcjnujedrn7uk73geo4ackio4lxdj6t7w6f4zsid.onion/ext_tw_video/643211870443208704/pu/vid/480x480/2a49dLeT5eSHhMhe.mp4