Open v-owendeng opened 2 years ago
In case it's useful information, here is another example URL not working: https://www.msn.com/en-us/weather/topstories/powerful-atlantic-storm-whips-up-colossal-waves-along-coast-of-united-kingdom/vi-AA19NdFy
log:
t-dlp https://www.msn.com/en-us/weather/topstories/powerful-atlantic-storm-whips-up-colossal-waves-along-coast-of-united-kingdom/vi-AA19NdFy --ignore-config -F -vU
[debug] Command-line config: ['https://www.msn.com/en-us/weather/topstories/powerful-atlantic-storm-whips-up-colossal-waves-along-coast-of-united-kingdom/vi-AA19NdFy', '--ignore-config', '-F', '-vU']
[debug] Encodings: locale cp1252, fs utf-8, pref cp1252, out utf-8, error utf-8, screen utf-8
[debug] yt-dlp version stable@2023.03.04 [392389b7d] (win_exe)
[debug] Python 3.8.10 (CPython AMD64 64bit) - Windows-10-10.0.19045-SP0 (OpenSSL 1.1.1k 25 Mar 2021)
[debug] exe versions: none
[debug] Optional libraries: Cryptodome-3.17, brotli-1.0.9, certifi-2022.12.07, mutagen-1.46.0, sqlite3-2.6.0, websockets-10.4
[debug] Proxy map: {}
[debug] Loaded 1786 extractors
[debug] Fetching release info: https://api.github.com/repos/yt-dlp/yt-dlp/releases/latest
Available version: stable@2023.03.04, Current version: stable@2023.03.04
Current Build Hash: 5590c57bd0433ed239a2deaaf92e2ad6f37fe50f53664c821575cafe106a9421
yt-dlp is up to date (stable@2023.03.04)
[MSN] Extracting URL: https://www.msn.com/en-us/weather/topstories/powerful-atlantic-storm-whips-up-colossal-waves-along-coast-of-united-kingdom/vi-AA19NdFy
[MSN] powerful-atlantic-storm-whips-up-colossal-waves-along-coast-of-united-kingdom: Downloading webpage
ERROR: [MSN] AA19NdFy: Unable to extract error; please report this issue on https://github.com/yt-dlp/yt-dlp/issues?q= , filling out the appropriate issue template. Confirm you are on the latest version using yt-dlp -U
File "yt_dlp\extractor\common.py", line 694, in extract
File "yt_dlp\extractor\msn.py", line 162, in _real_extract
File "yt_dlp\extractor\common.py", line 1242, in _search_regex
I'm just trying to clarify what's going on for anyone else like me who gets confused. And I'm embarrassed to note that it took me lots of clicking around before I actually noticed this rather prominent WARNING when I used the tool for an MSN site, despite running it a second time after upgrading yt-dlp:
WARNING: The program functionality for this site has been marked as broken, and will probably not work.
Why? I only looked at the tail end of what the tool printed. The warning comes as the first line of output.
I also note that some of the issues listed as a "duplicate" of this one contain information that I don't find here. E.g. this from dirkf:
The non-JS page has no useful content. New extraction tactics are needed.
Many thanks to all those who have helped make yt-dlp an amazing program trying to handle a tricky set of problems. And here's hoping that some new extraction tactics come online.
I note that phantomjs is listed with the note "phantomjs - Used in extractors where javascript needs to be run."
I also note that phantomjs was archived in 2023.
If there is a handy list to point to for those who like to learn and try some new extraction techniques, a link would be welcome.
@nealmcb PhantomJS is only used by 4 extractors, and it's only used as a last resort when there is no other path to extraction except by actually executing the site's javascript. (And for the most widely used of those 4 extractors, there is an alternative available). PhantomJS is likely not relevant to this issue.
If you're writing an extractor and the information you need isn't found in the page source/html, then what you should try to do is to find out what the javascript is doing to get the required info and recreate that in the extractor code; this usually amounts to making an API request (or multiple). In your browser's dev tools you can monitor all network requests being made as the page/video loads. Start there; more often than not the XHR calls will tell you all you need to know and you won't even need to look at the javascript code itself.
That seems to apply here exactly.
In this case it seems like the URL pattern should be tweaked to extract locale
as the xx-xx initial path component and then, instead of fetching the webpage, get the JSON from f'https://assets.msn.com/content/view/v2/Detail/{locale}/{page_id)'
.
The page from https://github.com/yt-dlp/yt-dlp/issues/3225#issuecomment-1506285806 gets 410 Gone for this; MSN still serves a page full of clickbait and ads without telling you that it's not the page you expected.
A current URL in the same area (https://www.msn.com/en-us/weather/topstories/midwest-northeast-bracing-for-potentially-dangerous-long-duration-heat-wave-next-week/vi-BB1oaoek?ocid=windirect&cvid=981228a2079b4d07908de15c5820a4e2&ei=98) with locale
en-us
and page_id
BB1oaoek
yields valid JSON. In the JSON, there are these potentially useful values:
{
...,
'videoMetadata': {
'playTime': 209,
'closedCaptions': [
{
'locale': 'en-us',
'href': 'https://prod-video-cms-amp-microsoft-com.akamaized.net/tenant/amp/entityid/BB1oaoek?blobrefkey=closedcaptionen-us&$blob=1'
}
],
'externalVideoFiles': [
{
'url': 'https://prod-streaming-video-msn-com.akamaized.net/aec52b0b-227e-47ac-a315-5ae3db64dcda/386123b1-6d12-4474-bc0a-455ae9bdc511.mp4',
'contentType': 'video/mp4',
'fileSize': 136785082,
'format': '1001'
},
{
'url': 'https://prod-streaming-video-msn-com.akamaized.net/8abb4c3d-a2d6-436b-8b67-f7e93c623f88/386123b1-6d12-4474-bc0a-455ae9bd.ism/manifest',
'width': 1280,
'height': 720,
'format': '1004'
},
{
'url': 'https://prod-streaming-video-msn-com.akamaized.net/8abb4c3d-a2d6-436b-8b67-f7e93c623f88/386123b1-6d12-4474-bc0a-455ae9bd.ism/manifest(format=m3u8-aapl)',
'width': 1280,
'height': 720,
'format': '1006'
},
/* 5 more video files */
...,
]
},
...,
'id': 'BB1oaoek',
'name': '',
'source': 'msn',
'type': 'video',
...,
'createdDateTime': '2024-06-13T14:17:54Z',
'updatedDateTime': '2024-06-13T14:27:11Z',
'publishedDateTime': '2024-06-13T14:06:49Z',
...,
}
Then (using upstream code):
$ python3 -m youtube_dl -vF 'https://www.msn.com/en-us/weather/topstories/midwest-northeast-bracing-for-potentially-dangerous-long-duration-heat-wave-next-week/vi-BB1oaoek?ocid=windirect&cvid=981228a2079b4d07908de15c5820a4e2&ei=98'
[debug] System config: []
[debug] User config: []
[debug] Custom config: []
[debug] Command-line args: ['-vF', 'https://www.msn.com/en-us/weather/topstories/midwest-northeast-bracing-for-potentially-dangerous-long-duration-heat-wave-next-week/vi-BB1oaoek?ocid=windirect&cvid=981228a2079b4d07908de15c5820a4e2&ei=98']
[debug] Encodings: locale UTF-8, fs utf-8, out utf-8, pref UTF-8
[debug] youtube-dl version 2021.12.17
[debug] Git HEAD: fc03c3bb3
[debug] Python 3.11.2 (CPython i686 32bit) - Linux-6.1.0-20-686-pae-i686-with-glibc2.36 - OpenSSL 3.0.11 19 Sep 2023 - glibc 2.36
[debug] exe versions: ffmpeg 5.1.4-0, ffprobe 5.1.4-0
[debug] Proxy map: {}
[MSN] midwest-northeast-bracing-for-potentially-dangerous-long-duration-heat-wave-next-week: Downloading page JSON
[MSN] midwest-northeast-bracing-for-potentially-dangerous-long-duration-heat-wave-next-week: Downloading ISM manifest
[MSN] midwest-northeast-bracing-for-potentially-dangerous-long-duration-heat-wave-next-week: Downloading m3u8 information
[MSN] midwest-northeast-bracing-for-potentially-dangerous-long-duration-heat-wave-next-week: Downloading MPD manifest
[download] Downloading playlist: BB1oaoek
[MSN] playlist BB1oaoek: Collected 1 video ids (downloading 1 of them)
[download] Downloading video 1 of 1
[info] Available formats for BB1oaoek:
format code extension resolution note
hls-audio-aac_und_2_96021_2_1 mp4 audio only [und]
mss-aac_und_2_96021_2_1-96 isma audio only 96k , AACL (48000Hz)
dash-5_A_aac_und_2_96021_2_1_1 m4a audio only DASH audio 96k , m4a_dash container, mp4a.40.2 (48000Hz)
mss-656 ismv 640x360 656k , H264, video only
dash-1_V_video_5 mp4 640x360 DASH video 656k , mp4_dash container, avc1.64001E, video only
hls-784 mp4 640x360 784k , avc1.64001e, video only
mss-1008 ismv 640x360 1008k , H264, video only
dash-1_V_video_4 mp4 640x360 DASH video 1008k , mp4_dash container, avc1.64001E, video only
hls-1145 mp4 640x360 1145k , avc1.64001e, video only
mss-1510 ismv 960x540 1510k , H264, video only
dash-1_V_video_3 mp4 960x540 DASH video 1510k , mp4_dash container, avc1.64001F, video only
hls-1658 mp4 960x540 1658k , avc1.64001f, video only
mss-2261 ismv 960x540 2261k , H264, video only
dash-1_V_video_2 mp4 960x540 DASH video 2261k , mp4_dash container, avc1.64001F, video only
hls-2425 mp4 960x540 2425k , avc1.64001f, video only
mss-3397 ismv 1280x720 3397k , H264, video only
dash-1_V_video_1 mp4 1280x720 DASH video 3397k , mp4_dash container, avc1.64001F, video only
hls-3586 mp4 1280x720 3586k , avc1.64001f, video only
mp4-101 mp4 640x360 650k
mp4-102 mp4 960x540 1500k
mp4-103 mp4 960x540 2250k
mp4-104 mp4 1280x720 3400k
mp4-1001 mp4 unknown (best)
[download] Finished downloading playlist: BB1oaoek
$
Wow, thanks, @bashonly and @dirkf! So does this suggest that adding some code from upstream, along with logic to get and process the json, might reliably yield MSN support for yt-dlp?
What is needed from upstream?
The problem MSN URLs mentioned in existing issues no longer lead to the expected content as far as I can tell.
Someone could open an issue upstream (or post here) suggesting various currently valid MSN URLs with playable media that should be supported, describing for each what is expected to be found (one MSN video, as above, a video from YT or DM or some other external host, a playlist of any of these). Suggestions as to what values should be expected for the standard JSON extraction parameters would also be useful.
Then extractor patches can be validated against those and via a PR upstream and the process of pulling upstream changes to yt-dlp (or a PR here if someone makes one) your wishes could be realised.
Thanks again. Here is one MSN page which plays somewhat, but fails with yt-dlp: https://www.msn.com/en-us/autos/news/boulder-starts-pilot-program-to-integrate-electric-vehicles-affordable-housing/vi-BB1mNnhD
The video plays for me for the first 5 seconds out of 30, then seems to pause - not sure why (which is why I tried yt-dlp!) Sorry if it is not so helpful....
I don't know about the available JSON extraction approaches here. I'm no devtools guru, but offhand I see some video requests from https://prod-streaming-video-msn-com.akamaized.net/
Here is the -vU
debug output:
yt-dlp -vU https://www.msn.com/en-us/autos/news/boulder-starts-pilot-program-to-integrate-electric-vehicles-affordable-housing/vi-BB1mNnhD
[debug] Command-line config: ['-vU', 'https://www.msn.com/en-us/autos/news/boulder-starts-pilot-program-to-integrate-electricvehicles-affordable-housing/vi-BB1mNnhD']
[debug] Encodings: locale UTF-8, fs utf-8, pref UTF-8, out utf-8, error utf-8, screen utf-8
[debug] yt-dlp version stable@2024.05.27 from yt-dlp/yt-dlp [12b248ce6] (pip)
[debug] Python 3.10.12 (CPython x86_64 64bit) - Linux-6.5.0-35-generic-x86_64-with-glibc2.35 (OpenSSL 3.0.2 15 Mar 2022, glibc 2.35)
[debug] exe versions: ffmpeg 4.4.2 (setts), ffprobe 4.4.2
[debug] Optional libraries: Cryptodome-3.18.0, brotli-1.0.9, certifi-2020.06.20, mutagen-1.46.0, requests-2.31.0, secretstorage-3.3.1, sqlite3-3.37.2, urllib3-2.1.0, websockets-12.0
[debug] Proxy map: {}
[debug] Request Handlers: urllib, requests, websockets
[debug] Loaded 1820 extractors
[debug] Fetching release info: https://api.github.com/repos/yt-dlp/yt-dlp/releases/latest
Latest version: stable@2024.05.27 from yt-dlp/yt-dlp
yt-dlp is up to date (stable@2024.05.27 from yt-dlp/yt-dlp)
WARNING: The program functionality for this site has been marked as broken, and will probably not work.
[MSN] Extracting URL: https://www.msn.com/en-us/autos/news/boulder-starts-pilot-program-to-integrate-electric-vehicles-affordable-housing/vi-BB1mNnhD
[MSN] boulder-starts-pilot-program-to-integrate-electric-vehicles-affordable-housing: Downloading webpage
ERROR: [MSN] BB1mNnhD: Unable to extract error; please report this issue on https://github.com/yt-dlp/yt-dlp/issues?q= , filling out the appropriate issue template. Confirm you are on the latest version using yt-dlp -U
File "/home/neal/.local/lib/python3.10/site-packages/yt_dlp/extractor/common.py", line 734, in extract
ie_result = self._real_extract(url)
File "/home/neal/.local/lib/python3.10/site-packages/yt_dlp/extractor/msn.py", line 163, in _real_extract
error = unescapeHTML(self._search_regex(
File "/home/neal/.local/lib/python3.10/site-packages/yt_dlp/extractor/common.py", line 1327, in _search_regex
raise RegexNotFoundError('Unable to extract %s' % _name)
That page is fine with the work-in-progress yt-dl extractor, or was while it was available (now 404). However the asset data is still available; your media link is (-f best
) https://prod-streaming-video-msn-com.akamaized.net/ab5cb567-c134-4630-b9f5-521fde0897f1/3e65a2bf-1da5-41eb-ad3e-d3583a926d47.mp4, which I played for the full 30s.
Checklist
Region
Singapore
Description
Could anyone help me with this? Thanks.
PS C:\Users\v-owendeng> yt-dlp https://www.msn.com/en-us/news/local/jeffco-public-schools-says-no-limit-on-guests-for-end-of-year-events/vi-BB1gIARh [MSN] jeffco-public-schools-says-no-limit-on-guests-for-end-of-year-events: Downloading webpage ERROR: [MSN] BB1gIARh: Unable to extract error; please report this issue on https://github.com/yt-dlp/yt-dlp , filling out the "Broken site" issue template properly. Confirm you are on the latest version using yt-dlp -U
Here are 3 test URLs: https://www.msn.com/en-us/video/cardio/crash-shuts-down-portion-of-brownsville-road/vi-AAL8z9q https://www.msn.com/en-us/news/local/jeffco-public-schools-says-no-limit-on-guests-for-end-of-year-events/vi-BB1gIARh https://www.msn.com/en-us/video/peopleandplaces/boston-mayor-janey-to-sign-measure-limiting-police-use-of-tear-gas/vp-BB1gchKd
Verbose log