yt-dlp / yt-dlp

A feature-rich command-line audio/video downloader
https://discord.gg/H5MNcFW63r
The Unlicense
88.45k stars 6.85k forks source link

Can't download from sohu #7463

Closed Thalia500 closed 1 year ago

Thalia500 commented 1 year ago

DO NOT REMOVE OR SKIP THE ISSUE TEMPLATE

Checklist

Region

China

Provide a description that is worded well enough to be understood

Can't download videos from sohu. Use URL https://tv.sohu.com/v/MjAyMzA2MTQvbjYwMTMxNTE5Mi5zaHRtbA==.html

Provide verbose output that clearly demonstrates the problem

Complete Verbose Output

yt-dlp -vU -F "https://tv.sohu.com/v/MjAyMzA2MTQvbjYwMTMxNTE5Mi5zaHRtbA==.html"
[debug] Command-line config: ['-vU', '-F', 'https://tv.sohu.com/v/MjAyMzA2MTQvbjYwMTMxNTE5Mi5zaHRtbA==.html']
[debug] Encodings: locale cp936, fs utf-8, pref cp936, out utf-8, error utf-8, screen utf-8
[debug] yt-dlp version stable@2023.06.22 [812cdfa06] (pip)
[debug] Lazy loading extractors is disabled
[debug] Python 3.10.9 (CPython AMD64 64bit) - Windows-10-10.0.22621-SP0 (OpenSSL 1.1.1t  7 Feb 2023)
[debug] exe versions: ffmpeg N-110972-gbaa9fccf8d-20230601 (setts), ffprobe N-110972-gbaa9fccf8d-20230601, phantomjs 2.1.1
[debug] Optional libraries: Cryptodome-3.18.0, brotli-None, certifi-2023.05.07, mutagen-1.46.0, sqlite3-2.6.0, websockets-11.0.3
[debug] Proxy map: {'http': 'http://127.0.0.1:1080', 'https': 'http://127.0.0.1:1080', 'ftp': 'http://127.0.0.1:1080'}
[debug] Loaded 1851 extractors
[debug] Fetching release info: https://api.github.com/repos/yt-dlp/yt-dlp/releases/latest
Available version: stable@2023.06.22, Current version: stable@2023.06.22
yt-dlp is up to date (stable@2023.06.22)
[generic] Extracting URL: https://tv.sohu.com/v/MjAyMzA2MTQvbjYwMTMxNTE5Mi5zaHRtbA==.html
[generic] MjAyMzA2MTQvbjYwMTMxNTE5Mi5zaHRtbA==: Downloading webpage
WARNING: [generic] Falling back on generic information extractor
[generic] MjAyMzA2MTQvbjYwMTMxNTE5Mi5zaHRtbA==: Extracting information
[debug] Looking for embeds
ERROR: Unsupported URL: https://tv.sohu.com/v/MjAyMzA2MTQvbjYwMTMxNTE5Mi5zaHRtbA==.html
Traceback (most recent call last):
  File "C:\python_conda_package\Anaconda3\lib\site-packages\yt_dlp\YoutubeDL.py", line 1555, in wrapper
    return func(self, *args, **kwargs)
  File "C:\python_conda_package\Anaconda3\lib\site-packages\yt_dlp\YoutubeDL.py", line 1631, in __extract_info
    ie_result = ie.extract(url)
  File "C:\python_conda_package\Anaconda3\lib\site-packages\yt_dlp\extractor\common.py", line 708, in extract
    ie_result = self._real_extract(url)
  File "C:\python_conda_package\Anaconda3\lib\site-packages\yt_dlp\extractor\generic.py", line 2568, in _real_extract
    raise UnsupportedError(url)
yt_dlp.utils.UnsupportedError: Unsupported URL: https://tv.sohu.com/v/MjAyMzA2MTQvbjYwMTMxNTE5Mi5zaHRtbA==.html
bashonly commented 1 year ago

Looks like that's a new type of URL. We can base64-decode the html basename...

>>> import base64
>>> base64.urlsafe_b64decode('MjAyMzA2MTQvbjYwMTMxNTE5Mi5zaHRtbA==').decode()
'20230614/n601315192.shtml'

...and get a regular Sohu url path.

A patch like this should handle it:

diff --git a/yt_dlp/extractor/sohu.py b/yt_dlp/extractor/sohu.py
index a8f1e4623..5f0fb6192 100644
--- a/yt_dlp/extractor/sohu.py
+++ b/yt_dlp/extractor/sohu.py
@@ -1,3 +1,4 @@
+import base64
 import re

 from .common import InfoExtractor
@@ -9,6 +10,7 @@
     ExtractorError,
     int_or_none,
     try_get,
+    urljoin,
 )

@@ -196,3 +198,12 @@ def _fetch_data(vid_id, mytv=False):
             }

         return info
+
+
+class SohuVIE(InfoExtractor):
+    _VALID_URL = r'(?P<base>https?://tv\.sohu\.com/)v/(?P<id>[\w=-]+)\.html(?:$|[#?])'
+
+    def _real_extract(self, url):
+        base_url, encoded_id = self._match_valid_url(url).group('base', 'id')
+        path = base64.urlsafe_b64decode(encoded_id).decode()
+        return self.url_result(urljoin(base_url, path), SohuIE)

I'm geo-blocked, but it should work for someone who's not:

$ yt-dlp -F "https://tv.sohu.com/v/MjAyMzA2MTQvbjYwMTMxNTE5Mi5zaHRtbA==.html"
[SohuV] Extracting URL: https://tv.sohu.com/v/MjAyMzA2MTQvbjYwMTMxNTE5Mi5zaHRtbA==.html
[Sohu] Extracting URL: https://tv.sohu.com/20230614/n601315192.shtml
[Sohu] 601315192: Downloading webpage
[Sohu] 601315192: Downloading JSON data for 8484094
ERROR: [Sohu] 601315192: Sohu said: The video is only licensed to users in Mainland China.
You might want to use a VPN or a proxy server (with --proxy) to workaround.

I think someone just needs to find out if there are my.tv.sohu.com links like this, too, and we could match them if so

Thalia500 commented 1 year ago

Looks like that's a new type of URL. We can base64-decode the html basename...

>>> import base64
>>> base64.urlsafe_b64decode('MjAyMzA2MTQvbjYwMTMxNTE5Mi5zaHRtbA==').decode()
'20230614/n601315192.shtml'

...and get a regular Sohu url path.

A patch like this should handle it:

diff --git a/yt_dlp/extractor/sohu.py b/yt_dlp/extractor/sohu.py
index a8f1e4623..5f0fb6192 100644
--- a/yt_dlp/extractor/sohu.py
+++ b/yt_dlp/extractor/sohu.py
@@ -1,3 +1,4 @@
+import base64
 import re

 from .common import InfoExtractor
@@ -9,6 +10,7 @@
     ExtractorError,
     int_or_none,
     try_get,
+    urljoin,
 )

@@ -196,3 +198,12 @@ def _fetch_data(vid_id, mytv=False):
             }

         return info
+
+
+class SohuVIE(InfoExtractor):
+    _VALID_URL = r'(?P<base>https?://tv\.sohu\.com/)v/(?P<id>[\w=-]+)\.html(?:$|[#?])'
+
+    def _real_extract(self, url):
+        base_url, encoded_id = self._match_valid_url(url).group('base', 'id')
+        path = base64.urlsafe_b64decode(encoded_id).decode()
+        return self.url_result(urljoin(base_url, path), SohuIE)

I'm geo-blocked, but it should work for someone who's not:

$ yt-dlp -F "https://tv.sohu.com/v/MjAyMzA2MTQvbjYwMTMxNTE5Mi5zaHRtbA==.html"
[SohuV] Extracting URL: https://tv.sohu.com/v/MjAyMzA2MTQvbjYwMTMxNTE5Mi5zaHRtbA==.html
[Sohu] Extracting URL: https://tv.sohu.com/20230614/n601315192.shtml
[Sohu] 601315192: Downloading webpage
[Sohu] 601315192: Downloading JSON data for 8484094
ERROR: [Sohu] 601315192: Sohu said: The video is only licensed to users in Mainland China.
You might want to use a VPN or a proxy server (with --proxy) to workaround.

I think someone just needs to find out if there are my.tv.sohu.com links like this, too, and we could match them if so

Thankyou!!!! But it seems like it has other issues. yt-dlp -vU -F "https://tv.sohu.com/v/MjAyMzA2MTQvbjYwMTMxNTE5Mi5zaHRtbA==.html" [debug] Command-line config: ['-vU', '-F', 'https://tv.sohu.com/v/MjAyMzA2MTQvbjYwMTMxNTE5Mi5zaHRtbA==.html'] [debug] Encodings: locale cp936, fs utf-8, pref cp936, out utf-8, error utf-8, screen utf-8 [debug] yt-dlp version stable@2023.06.22 [812cdfa06] (pip) [debug] Lazy loading extractors is disabled [debug] Python 3.10.9 (CPython AMD64 64bit) - Windows-10-10.0.22621-SP0 (OpenSSL 1.1.1t 7 Feb 2023) [debug] exe versions: ffmpeg N-110972-gbaa9fccf8d-20230601 (setts), ffprobe N-110972-gbaa9fccf8d-20230601, phantomjs 2.1.1 [debug] Optional libraries: Cryptodome-3.18.0, brotli-None, certifi-2023.05.07, mutagen-1.46.0, sqlite3-2.6.0, websockets-11.0.3 [debug] Proxy map: {} [debug] Loaded 1851 extractors [debug] Fetching release info: https://api.github.com/repos/yt-dlp/yt-dlp/releases/latest Available version: stable@2023.06.22, Current version: stable@2023.06.22 yt-dlp is up to date (stable@2023.06.22) [generic] Extracting URL: https://tv.sohu.com/v/MjAyMzA2MTQvbjYwMTMxNTE5Mi5zaHRtbA==.html [generic] MjAyMzA2MTQvbjYwMTMxNTE5Mi5zaHRtbA==: Downloading webpage WARNING: [generic] Falling back on generic information extractor [generic] MjAyMzA2MTQvbjYwMTMxNTE5Mi5zaHRtbA==: Extracting information [debug] Looking for embeds ERROR: Unsupported URL: https://tv.sohu.com/v/MjAyMzA2MTQvbjYwMTMxNTE5Mi5zaHRtbA==.html Traceback (most recent call last): File "C:\python_conda_package\Anaconda3\lib\site-packages\yt_dlp\YoutubeDL.py", line 1555, in wrapper return func(self, *args, **kwargs) File "C:\python_conda_package\Anaconda3\lib\site-packages\yt_dlp\YoutubeDL.py", line 1631, in __extract_info ie_result = ie.extract(url) File "C:\python_conda_package\Anaconda3\lib\site-packages\yt_dlp\extractor\common.py", line 708, in extract ie_result = self._real_extract(url) File "C:\python_conda_package\Anaconda3\lib\site-packages\yt_dlp\extractor\generic.py", line 2568, in _real_extract raise UnsupportedError(url) yt_dlp.utils.UnsupportedError: Unsupported URL: https://tv.sohu.com/v/MjAyMzA2MTQvbjYwMTMxNTE5Mi5zaHRtbA==.html

c-basalt commented 1 year ago

@Thalia500 I've applied the patch by bashonly to this branch (new IE class needs to be imported in _extractors.py). You can take a look at this: https://github.com/c-basalt/yt-dlp/tree/sohu-fix

I checked some old my.tv.sohu.com links in test cases and they are redirected as well, though the redirected domain is also tv.sohu.com. ~The Multipart video link in the test case appears to be broken. Might want to fix that before starting a PR.~ Multipart video URL is now fixed, just need some testing and feedback.

bashonly commented 1 year ago

Duplicate of #1667 (thanks for finding that @c-basalt)