Can't download from sohu

DO NOT REMOVE OR SKIP THE ISSUE TEMPLATE

[X] I understand that I will be blocked if I intentionally remove or skip any mandatory* field

Checklist

[X] I'm reporting that yt-dlp is broken on a supported site
[X] I've verified that I'm running yt-dlp version 2023.06.22 (update instructions) or later (specify commit)
[X] I've checked that all provided URLs are playable in a browser with the same IP and same login details
[X] I've checked that all URLs and arguments with special characters are properly quoted or escaped
[X] I've searched known issues and the bugtracker for similar issues including closed ones. DO NOT post duplicates
[X] I've read the guidelines for opening an issue
[X] I've read about sharing account credentials and I'm willing to share it if required

Region

China

Provide a description that is worded well enough to be understood

Can't download videos from sohu. Use URL https://tv.sohu.com/v/MjAyMzA2MTQvbjYwMTMxNTE5Mi5zaHRtbA==.html

Provide verbose output that clearly demonstrates the problem

[X] Run your yt-dlp command with -vU flag added (yt-dlp -vU <your command line>)
[ ] If using API, add 'verbose': True to YoutubeDL params instead
[X] Copy the WHOLE output (starting with [debug] Command-line config) and insert it below

Complete Verbose Output

yt-dlp -vU -F "https://tv.sohu.com/v/MjAyMzA2MTQvbjYwMTMxNTE5Mi5zaHRtbA==.html"
[debug] Command-line config: ['-vU', '-F', 'https://tv.sohu.com/v/MjAyMzA2MTQvbjYwMTMxNTE5Mi5zaHRtbA==.html']
[debug] Encodings: locale cp936, fs utf-8, pref cp936, out utf-8, error utf-8, screen utf-8
[debug] yt-dlp version stable@2023.06.22 [812cdfa06] (pip)
[debug] Lazy loading extractors is disabled
[debug] Python 3.10.9 (CPython AMD64 64bit) - Windows-10-10.0.22621-SP0 (OpenSSL 1.1.1t  7 Feb 2023)
[debug] exe versions: ffmpeg N-110972-gbaa9fccf8d-20230601 (setts), ffprobe N-110972-gbaa9fccf8d-20230601, phantomjs 2.1.1
[debug] Optional libraries: Cryptodome-3.18.0, brotli-None, certifi-2023.05.07, mutagen-1.46.0, sqlite3-2.6.0, websockets-11.0.3
[debug] Proxy map: {'http': 'http://127.0.0.1:1080', 'https': 'http://127.0.0.1:1080', 'ftp': 'http://127.0.0.1:1080'}
[debug] Loaded 1851 extractors
[debug] Fetching release info: https://api.github.com/repos/yt-dlp/yt-dlp/releases/latest
Available version: stable@2023.06.22, Current version: stable@2023.06.22
yt-dlp is up to date (stable@2023.06.22)
[generic] Extracting URL: https://tv.sohu.com/v/MjAyMzA2MTQvbjYwMTMxNTE5Mi5zaHRtbA==.html
[generic] MjAyMzA2MTQvbjYwMTMxNTE5Mi5zaHRtbA==: Downloading webpage
WARNING: [generic] Falling back on generic information extractor
[generic] MjAyMzA2MTQvbjYwMTMxNTE5Mi5zaHRtbA==: Extracting information
[debug] Looking for embeds
ERROR: Unsupported URL: https://tv.sohu.com/v/MjAyMzA2MTQvbjYwMTMxNTE5Mi5zaHRtbA==.html
Traceback (most recent call last):
  File "C:\python_conda_package\Anaconda3\lib\site-packages\yt_dlp\YoutubeDL.py", line 1555, in wrapper
    return func(self, *args, **kwargs)
  File "C:\python_conda_package\Anaconda3\lib\site-packages\yt_dlp\YoutubeDL.py", line 1631, in __extract_info
    ie_result = ie.extract(url)
  File "C:\python_conda_package\Anaconda3\lib\site-packages\yt_dlp\extractor\common.py", line 708, in extract
    ie_result = self._real_extract(url)
  File "C:\python_conda_package\Anaconda3\lib\site-packages\yt_dlp\extractor\generic.py", line 2568, in _real_extract
    raise UnsupportedError(url)
yt_dlp.utils.UnsupportedError: Unsupported URL: https://tv.sohu.com/v/MjAyMzA2MTQvbjYwMTMxNTE5Mi5zaHRtbA==.html

Looks like that's a new type of URL. We can base64-decode the html basename...

>>> import base64
>>> base64.urlsafe_b64decode('MjAyMzA2MTQvbjYwMTMxNTE5Mi5zaHRtbA==').decode()
'20230614/n601315192.shtml'

...and get a regular Sohu url path.

A patch like this should handle it:

diff --git a/yt_dlp/extractor/sohu.py b/yt_dlp/extractor/sohu.py
index a8f1e4623..5f0fb6192 100644
--- a/yt_dlp/extractor/sohu.py
+++ b/yt_dlp/extractor/sohu.py
@@ -1,3 +1,4 @@
+import base64
 import re

 from .common import InfoExtractor
@@ -9,6 +10,7 @@
     ExtractorError,
     int_or_none,
     try_get,
+    urljoin,
 )

@@ -196,3 +198,12 @@ def _fetch_data(vid_id, mytv=False):
             }

         return info
+
+
+class SohuVIE(InfoExtractor):
+    _VALID_URL = r'(?P<base>https?://tv\.sohu\.com/)v/(?P<id>[\w=-]+)\.html(?:$|[#?])'
+
+    def _real_extract(self, url):
+        base_url, encoded_id = self._match_valid_url(url).group('base', 'id')
+        path = base64.urlsafe_b64decode(encoded_id).decode()
+        return self.url_result(urljoin(base_url, path), SohuIE)

I'm geo-blocked, but it should work for someone who's not:

$ yt-dlp -F "https://tv.sohu.com/v/MjAyMzA2MTQvbjYwMTMxNTE5Mi5zaHRtbA==.html"
[SohuV] Extracting URL: https://tv.sohu.com/v/MjAyMzA2MTQvbjYwMTMxNTE5Mi5zaHRtbA==.html
[Sohu] Extracting URL: https://tv.sohu.com/20230614/n601315192.shtml
[Sohu] 601315192: Downloading webpage
[Sohu] 601315192: Downloading JSON data for 8484094
ERROR: [Sohu] 601315192: Sohu said: The video is only licensed to users in Mainland China.
You might want to use a VPN or a proxy server (with --proxy) to workaround.

I think someone just needs to find out if there are my.tv.sohu.com links like this, too, and we could match them if so

Looks like that's a new type of URL. We can base64-decode the html basename...

>>> import base64
>>> base64.urlsafe_b64decode('MjAyMzA2MTQvbjYwMTMxNTE5Mi5zaHRtbA==').decode()
'20230614/n601315192.shtml'

...and get a regular Sohu url path.

A patch like this should handle it:

diff --git a/yt_dlp/extractor/sohu.py b/yt_dlp/extractor/sohu.py
index a8f1e4623..5f0fb6192 100644
--- a/yt_dlp/extractor/sohu.py
+++ b/yt_dlp/extractor/sohu.py
@@ -1,3 +1,4 @@
+import base64
 import re

 from .common import InfoExtractor
@@ -9,6 +10,7 @@
     ExtractorError,
     int_or_none,
     try_get,
+    urljoin,
 )

@@ -196,3 +198,12 @@ def _fetch_data(vid_id, mytv=False):
             }

         return info
+
+
+class SohuVIE(InfoExtractor):
+    _VALID_URL = r'(?P<base>https?://tv\.sohu\.com/)v/(?P<id>[\w=-]+)\.html(?:$|[#?])'
+
+    def _real_extract(self, url):
+        base_url, encoded_id = self._match_valid_url(url).group('base', 'id')
+        path = base64.urlsafe_b64decode(encoded_id).decode()
+        return self.url_result(urljoin(base_url, path), SohuIE)

I'm geo-blocked, but it should work for someone who's not:

$ yt-dlp -F "https://tv.sohu.com/v/MjAyMzA2MTQvbjYwMTMxNTE5Mi5zaHRtbA==.html"
[SohuV] Extracting URL: https://tv.sohu.com/v/MjAyMzA2MTQvbjYwMTMxNTE5Mi5zaHRtbA==.html
[Sohu] Extracting URL: https://tv.sohu.com/20230614/n601315192.shtml
[Sohu] 601315192: Downloading webpage
[Sohu] 601315192: Downloading JSON data for 8484094
ERROR: [Sohu] 601315192: Sohu said: The video is only licensed to users in Mainland China.
You might want to use a VPN or a proxy server (with --proxy) to workaround.

I think someone just needs to find out if there are my.tv.sohu.com links like this, too, and we could match them if so

Thankyou!!!! But it seems like it has other issues. yt-dlp -vU -F "https://tv.sohu.com/v/MjAyMzA2MTQvbjYwMTMxNTE5Mi5zaHRtbA==.html" [debug] Command-line config: ['-vU', '-F', 'https://tv.sohu.com/v/MjAyMzA2MTQvbjYwMTMxNTE5Mi5zaHRtbA==.html'] [debug] Encodings: locale cp936, fs utf-8, pref cp936, out utf-8, error utf-8, screen utf-8 [debug] yt-dlp version stable@2023.06.22 [812cdfa06] (pip) [debug] Lazy loading extractors is disabled [debug] Python 3.10.9 (CPython AMD64 64bit) - Windows-10-10.0.22621-SP0 (OpenSSL 1.1.1t 7 Feb 2023) [debug] exe versions: ffmpeg N-110972-gbaa9fccf8d-20230601 (setts), ffprobe N-110972-gbaa9fccf8d-20230601, phantomjs 2.1.1 [debug] Optional libraries: Cryptodome-3.18.0, brotli-None, certifi-2023.05.07, mutagen-1.46.0, sqlite3-2.6.0, websockets-11.0.3 [debug] Proxy map: {} [debug] Loaded 1851 extractors [debug] Fetching release info: https://api.github.com/repos/yt-dlp/yt-dlp/releases/latest Available version: stable@2023.06.22, Current version: stable@2023.06.22 yt-dlp is up to date (stable@2023.06.22) [generic] Extracting URL: https://tv.sohu.com/v/MjAyMzA2MTQvbjYwMTMxNTE5Mi5zaHRtbA==.html [generic] MjAyMzA2MTQvbjYwMTMxNTE5Mi5zaHRtbA==: Downloading webpage WARNING: [generic] Falling back on generic information extractor [generic] MjAyMzA2MTQvbjYwMTMxNTE5Mi5zaHRtbA==: Extracting information [debug] Looking for embeds ERROR: Unsupported URL: https://tv.sohu.com/v/MjAyMzA2MTQvbjYwMTMxNTE5Mi5zaHRtbA==.html Traceback (most recent call last): File "C:\python_conda_package\Anaconda3\lib\site-packages\yt_dlp\YoutubeDL.py", line 1555, in wrapper return func(self, *args, **kwargs) File "C:\python_conda_package\Anaconda3\lib\site-packages\yt_dlp\YoutubeDL.py", line 1631, in __extract_info ie_result = ie.extract(url) File "C:\python_conda_package\Anaconda3\lib\site-packages\yt_dlp\extractor\common.py", line 708, in extract ie_result = self._real_extract(url) File "C:\python_conda_package\Anaconda3\lib\site-packages\yt_dlp\extractor\generic.py", line 2568, in _real_extract raise UnsupportedError(url) yt_dlp.utils.UnsupportedError: Unsupported URL: https://tv.sohu.com/v/MjAyMzA2MTQvbjYwMTMxNTE5Mi5zaHRtbA==.html

@Thalia500 I've applied the patch by bashonly to this branch (new IE class needs to be imported in _extractors.py). You can take a look at this: https://github.com/c-basalt/yt-dlp/tree/sohu-fix

I checked some old my.tv.sohu.com links in test cases and they are redirected as well, though the redirected domain is also tv.sohu.com. ~The Multipart video link in the test case appears to be broken. Might want to fix that before starting a PR.~ Multipart video URL is now fixed, just need some testing and feedback.

Duplicate of #1667 (thanks for finding that @c-basalt)

yt-dlp / yt-dlp