yt-dlp / yt-dlp

A feature-rich command-line audio/video downloader
https://discord.gg/H5MNcFW63r
The Unlicense
89.27k stars 6.91k forks source link

[MSN] Download videos from msn failed with error "Unable to extract error". #3225

Open v-owendeng opened 2 years ago

v-owendeng commented 2 years ago

Checklist

Region

Singapore

Description

Could anyone help me with this? Thanks.

PS C:\Users\v-owendeng> yt-dlp https://www.msn.com/en-us/news/local/jeffco-public-schools-says-no-limit-on-guests-for-end-of-year-events/vi-BB1gIARh [MSN] jeffco-public-schools-says-no-limit-on-guests-for-end-of-year-events: Downloading webpage ERROR: [MSN] BB1gIARh: Unable to extract error; please report this issue on https://github.com/yt-dlp/yt-dlp , filling out the "Broken site" issue template properly. Confirm you are on the latest version using yt-dlp -U

Here are 3 test URLs: https://www.msn.com/en-us/video/cardio/crash-shuts-down-portion-of-brownsville-road/vi-AAL8z9q https://www.msn.com/en-us/news/local/jeffco-public-schools-says-no-limit-on-guests-for-end-of-year-events/vi-BB1gIARh https://www.msn.com/en-us/video/peopleandplaces/boston-mayor-janey-to-sign-measure-limiting-police-use-of-tear-gas/vp-BB1gchKd

Verbose log

PS C:\Users\v-owendeng> yt-dlp -vU "https://www.msn.com/en-us/video/cardio/crash-shuts-down-portion-of-brownsville-road/vi-AAL8z9q"
[debug] Command-line config: ['-vU', 'https://www.msn.com/en-us/video/cardio/crash-shuts-down-portion-of-brownsville-road/vi-AAL8z9q']
[debug] Encodings: locale cp936, fs utf-8, out utf-8, err utf-8, pref cp936
[debug] yt-dlp version 2022.03.08.1 [c0c2c57] (win_exe)
[debug] Python version 3.8.10 (CPython 64bit) - Windows-10-10.0.22000-SP0
[debug] exe versions: ffmpeg 2022-03-17-git-242c07982a-full_build-www.gyan.dev (setts), ffprobe 2022-03-17-git-242c07982a-full_build-www.gyan.dev
[debug] Optional libraries: brotli, Cryptodome, mutagen, sqlite, websockets
[debug] Proxy map: {}
Latest version: 2022.03.08.1, Current version: 2022.03.08.1
yt-dlp is up to date (2022.03.08.1)
[debug] [MSN] Extracting URL: https://www.msn.com/en-us/video/cardio/crash-shuts-down-portion-of-brownsville-road/vi-AAL8z9q
[MSN] crash-shuts-down-portion-of-brownsville-road: Downloading webpage
ERROR: [MSN] AAL8z9q: Unable to extract error; please report this issue on  https://github.com/yt-dlp/yt-dlp , filling out the "Broken site" issue template properly. Confirm you are on the latest version using  yt-dlp -U
  File "yt_dlp\extractor\common.py", line 617, in extract
  File "yt_dlp\extractor\msn.py", line 166, in _real_extract
  File "yt_dlp\extractor\common.py", line 1192, in _search_regex
Miteirao commented 2 years ago

Not working for Brazil links also https://www.msn.com/pt-br/receitasebebidas/noticias-e-receitas/chips-de-gravatinha-o-aperitivo-f%C3%A1cil-e-diferente-que-vai-fazer-sucesso/vi-AAZm5Vr

chrizilla commented 1 year ago

In case it's useful information, here is another example URL not working: https://www.msn.com/en-us/weather/topstories/powerful-atlantic-storm-whips-up-colossal-waves-along-coast-of-united-kingdom/vi-AA19NdFy

log:

t-dlp https://www.msn.com/en-us/weather/topstories/powerful-atlantic-storm-whips-up-colossal-waves-along-coast-of-united-kingdom/vi-AA19NdFy --ignore-config -F -vU
[debug] Command-line config: ['https://www.msn.com/en-us/weather/topstories/powerful-atlantic-storm-whips-up-colossal-waves-along-coast-of-united-kingdom/vi-AA19NdFy', '--ignore-config', '-F', '-vU']
[debug] Encodings: locale cp1252, fs utf-8, pref cp1252, out utf-8, error utf-8, screen utf-8
[debug] yt-dlp version stable@2023.03.04 [392389b7d] (win_exe)
[debug] Python 3.8.10 (CPython AMD64 64bit) - Windows-10-10.0.19045-SP0 (OpenSSL 1.1.1k  25 Mar 2021)
[debug] exe versions: none
[debug] Optional libraries: Cryptodome-3.17, brotli-1.0.9, certifi-2022.12.07, mutagen-1.46.0, sqlite3-2.6.0, websockets-10.4
[debug] Proxy map: {}
[debug] Loaded 1786 extractors
[debug] Fetching release info: https://api.github.com/repos/yt-dlp/yt-dlp/releases/latest
Available version: stable@2023.03.04, Current version: stable@2023.03.04
Current Build Hash: 5590c57bd0433ed239a2deaaf92e2ad6f37fe50f53664c821575cafe106a9421
yt-dlp is up to date (stable@2023.03.04)
[MSN] Extracting URL: https://www.msn.com/en-us/weather/topstories/powerful-atlantic-storm-whips-up-colossal-waves-along-coast-of-united-kingdom/vi-AA19NdFy
[MSN] powerful-atlantic-storm-whips-up-colossal-waves-along-coast-of-united-kingdom: Downloading webpage
ERROR: [MSN] AA19NdFy: Unable to extract error; please report this issue on  https://github.com/yt-dlp/yt-dlp/issues?q= , filling out the appropriate issue template. Confirm you are on the latest version using  yt-dlp -U
  File "yt_dlp\extractor\common.py", line 694, in extract
  File "yt_dlp\extractor\msn.py", line 162, in _real_extract
  File "yt_dlp\extractor\common.py", line 1242, in _search_regex
nealmcb commented 4 months ago

I'm just trying to clarify what's going on for anyone else like me who gets confused. And I'm embarrassed to note that it took me lots of clicking around before I actually noticed this rather prominent WARNING when I used the tool for an MSN site, despite running it a second time after upgrading yt-dlp:

WARNING: The program functionality for this site has been marked as broken, and will probably not work.

Why? I only looked at the tail end of what the tool printed. The warning comes as the first line of output.

I also note that some of the issues listed as a "duplicate" of this one contain information that I don't find here. E.g. this from dirkf:

The non-JS page has no useful content. New extraction tactics are needed.

Many thanks to all those who have helped make yt-dlp an amazing program trying to handle a tricky set of problems. And here's hoping that some new extraction tactics come online.

I note that phantomjs is listed with the note "phantomjs - Used in extractors where javascript needs to be run."

I also note that phantomjs was archived in 2023.

If there is a handy list to point to for those who like to learn and try some new extraction techniques, a link would be welcome.

bashonly commented 4 months ago

@nealmcb PhantomJS is only used by 4 extractors, and it's only used as a last resort when there is no other path to extraction except by actually executing the site's javascript. (And for the most widely used of those 4 extractors, there is an alternative available). PhantomJS is likely not relevant to this issue.

If you're writing an extractor and the information you need isn't found in the page source/html, then what you should try to do is to find out what the javascript is doing to get the required info and recreate that in the extractor code; this usually amounts to making an API request (or multiple). In your browser's dev tools you can monitor all network requests being made as the page/video loads. Start there; more often than not the XHR calls will tell you all you need to know and you won't even need to look at the javascript code itself.

dirkf commented 4 months ago

That seems to apply here exactly.

In this case it seems like the URL pattern should be tweaked to extract locale as the xx-xx initial path component and then, instead of fetching the webpage, get the JSON from f'https://assets.msn.com/content/view/v2/Detail/{locale}/{page_id)'.

The page from https://github.com/yt-dlp/yt-dlp/issues/3225#issuecomment-1506285806 gets 410 Gone for this; MSN still serves a page full of clickbait and ads without telling you that it's not the page you expected.

A current URL in the same area (https://www.msn.com/en-us/weather/topstories/midwest-northeast-bracing-for-potentially-dangerous-long-duration-heat-wave-next-week/vi-BB1oaoek?ocid=windirect&cvid=981228a2079b4d07908de15c5820a4e2&ei=98) with locale en-us and page_id BB1oaoek yields valid JSON. In the JSON, there are these potentially useful values:

{
  ...,
  'videoMetadata': {
    'playTime': 209,
    'closedCaptions': [
      {
        'locale': 'en-us',
        'href': 'https://prod-video-cms-amp-microsoft-com.akamaized.net/tenant/amp/entityid/BB1oaoek?blobrefkey=closedcaptionen-us&$blob=1'
      }
    ],
    'externalVideoFiles': [
      {
        'url': 'https://prod-streaming-video-msn-com.akamaized.net/aec52b0b-227e-47ac-a315-5ae3db64dcda/386123b1-6d12-4474-bc0a-455ae9bdc511.mp4',
        'contentType': 'video/mp4',
        'fileSize': 136785082,
        'format': '1001'
      },
      {
        'url': 'https://prod-streaming-video-msn-com.akamaized.net/8abb4c3d-a2d6-436b-8b67-f7e93c623f88/386123b1-6d12-4474-bc0a-455ae9bd.ism/manifest',
        'width': 1280,
        'height': 720,
        'format': '1004'
      },
      {
        'url': 'https://prod-streaming-video-msn-com.akamaized.net/8abb4c3d-a2d6-436b-8b67-f7e93c623f88/386123b1-6d12-4474-bc0a-455ae9bd.ism/manifest(format=m3u8-aapl)',
        'width': 1280,
        'height': 720,
        'format': '1006'
      },
      /* 5 more video files */
      ...,
    ]
  },
  ...,
  'id': 'BB1oaoek',
  'name': '',
  'source': 'msn',
  'type': 'video',
  ...,
  'createdDateTime': '2024-06-13T14:17:54Z',
  'updatedDateTime': '2024-06-13T14:27:11Z',
  'publishedDateTime': '2024-06-13T14:06:49Z',
  ...,
}

Then (using upstream code):

$ python3 -m youtube_dl -vF 'https://www.msn.com/en-us/weather/topstories/midwest-northeast-bracing-for-potentially-dangerous-long-duration-heat-wave-next-week/vi-BB1oaoek?ocid=windirect&cvid=981228a2079b4d07908de15c5820a4e2&ei=98'
[debug] System config: []
[debug] User config: []
[debug] Custom config: []
[debug] Command-line args: ['-vF', 'https://www.msn.com/en-us/weather/topstories/midwest-northeast-bracing-for-potentially-dangerous-long-duration-heat-wave-next-week/vi-BB1oaoek?ocid=windirect&cvid=981228a2079b4d07908de15c5820a4e2&ei=98']
[debug] Encodings: locale UTF-8, fs utf-8, out utf-8, pref UTF-8
[debug] youtube-dl version 2021.12.17
[debug] Git HEAD: fc03c3bb3
[debug] Python 3.11.2 (CPython i686 32bit) - Linux-6.1.0-20-686-pae-i686-with-glibc2.36 - OpenSSL 3.0.11 19 Sep 2023 - glibc 2.36
[debug] exe versions: ffmpeg 5.1.4-0, ffprobe 5.1.4-0
[debug] Proxy map: {}
[MSN] midwest-northeast-bracing-for-potentially-dangerous-long-duration-heat-wave-next-week: Downloading page JSON
[MSN] midwest-northeast-bracing-for-potentially-dangerous-long-duration-heat-wave-next-week: Downloading ISM manifest
[MSN] midwest-northeast-bracing-for-potentially-dangerous-long-duration-heat-wave-next-week: Downloading m3u8 information
[MSN] midwest-northeast-bracing-for-potentially-dangerous-long-duration-heat-wave-next-week: Downloading MPD manifest
[download] Downloading playlist: BB1oaoek
[MSN] playlist BB1oaoek: Collected 1 video ids (downloading 1 of them)
[download] Downloading video 1 of 1
[info] Available formats for BB1oaoek:
format code                     extension  resolution note
hls-audio-aac_und_2_96021_2_1   mp4        audio only [und] 
mss-aac_und_2_96021_2_1-96      isma       audio only   96k , AACL  (48000Hz)
dash-5_A_aac_und_2_96021_2_1_1  m4a        audio only DASH audio   96k , m4a_dash container, mp4a.40.2 (48000Hz)
mss-656                         ismv       640x360     656k , H264, video only
dash-1_V_video_5                mp4        640x360    DASH video  656k , mp4_dash container, avc1.64001E, video only
hls-784                         mp4        640x360     784k , avc1.64001e, video only
mss-1008                        ismv       640x360    1008k , H264, video only
dash-1_V_video_4                mp4        640x360    DASH video 1008k , mp4_dash container, avc1.64001E, video only
hls-1145                        mp4        640x360    1145k , avc1.64001e, video only
mss-1510                        ismv       960x540    1510k , H264, video only
dash-1_V_video_3                mp4        960x540    DASH video 1510k , mp4_dash container, avc1.64001F, video only
hls-1658                        mp4        960x540    1658k , avc1.64001f, video only
mss-2261                        ismv       960x540    2261k , H264, video only
dash-1_V_video_2                mp4        960x540    DASH video 2261k , mp4_dash container, avc1.64001F, video only
hls-2425                        mp4        960x540    2425k , avc1.64001f, video only
mss-3397                        ismv       1280x720   3397k , H264, video only
dash-1_V_video_1                mp4        1280x720   DASH video 3397k , mp4_dash container, avc1.64001F, video only
hls-3586                        mp4        1280x720   3586k , avc1.64001f, video only
mp4-101                         mp4        640x360     650k
mp4-102                         mp4        960x540    1500k
mp4-103                         mp4        960x540    2250k
mp4-104                         mp4        1280x720   3400k
mp4-1001                        mp4        unknown    (best)
[download] Finished downloading playlist: BB1oaoek
$
nealmcb commented 4 months ago

Wow, thanks, @bashonly and @dirkf! So does this suggest that adding some code from upstream, along with logic to get and process the json, might reliably yield MSN support for yt-dlp?

What is needed from upstream?

dirkf commented 4 months ago

The problem MSN URLs mentioned in existing issues no longer lead to the expected content as far as I can tell.

Someone could open an issue upstream (or post here) suggesting various currently valid MSN URLs with playable media that should be supported, describing for each what is expected to be found (one MSN video, as above, a video from YT or DM or some other external host, a playlist of any of these). Suggestions as to what values should be expected for the standard JSON extraction parameters would also be useful.

What the video above gets ```console $ python3 -m youtube_dl -j 'https://www.msn.com/en-us/weather/topstories/midwest-northeast-bracing-for-potentially-dangerous-long-duration-heat-wave-next-week/vi-BB1oaoek?ocid=windirect&cvid=981228a2079b4d07908de15c5820a4e2&ei=98' | jq 'del(.["formats","requested_formats"])' { "id": "BB1oaoek", "title": "Midwest, Northeast bracing for potentially dangerous, long-duration heat wave next week", "subtitles": { "en-us": [ { "url": "https://prod-video-cms-amp-microsoft-com.akamaized.net/tenant/amp/entityid/BB1oaoek?blobrefkey=closedcaptionen-us&$blob=1", "ext": "ttml" } ] }, "description": "The first long-duration heat wave of the summer is expected to blast millions of people from the Midwest to the Northeast starting this weekend and lasting into next week.", "thumbnail": "https://img-s-msn-com.akamaized.net/tenant/amp/entityid/BB1oaoed.img", "uploader": "Fox Weather", "duration": 209, "timestamp": 1718287609, "n_entries": 1, "playlist": "BB1oaoek", "playlist_id": "BB1oaoek", "playlist_title": null, "playlist_uploader": null, "playlist_uploader_id": null, "playlist_index": 1, "extractor": "MSN", "webpage_url": "https://www.msn.com/en-us/weather/topstories/midwest-northeast-bracing-for-potentially-dangerous-long-duration-heat-wave-next-week/vi-BB1oaoek?ocid=windirect&cvid=981228a2079b4d07908de15c5820a4e2&ei=98", "webpage_url_basename": "vi-BB1oaoek", "extractor_key": "MSN", "thumbnails": [ { "url": "https://img-s-msn-com.akamaized.net/tenant/amp/entityid/BB1oaoed.img", "id": "0" } ], "display_id": "BB1oaoek", "upload_date": "20240613", "requested_subtitles": null, "http_headers": null, "format": "hls-3586 - 1280x720+dash-5_A_aac_und_2_96021_2_1_1 - audio only (DASH audio)", "format_id": "hls-3586+dash-5_A_aac_und_2_96021_2_1_1", "width": 1280, "height": 720, "resolution": null, "fps": null, "vcodec": "avc1.64001f", "vbr": null, "stretched_ratio": null, "acodec": "mp4a.40.2", "abr": null, "ext": "mp4", "fulltitle": "Midwest, Northeast bracing for potentially dangerous, long-duration heat wave next week", "_filename": "Midwest, Northeast bracing for potentially dangerous, long-duration heat wave next week-BB1oaoek.mp4" } ```

Then extractor patches can be validated against those and via a PR upstream and the process of pulling upstream changes to yt-dlp (or a PR here if someone makes one) your wishes could be realised.

nealmcb commented 4 months ago

Thanks again. Here is one MSN page which plays somewhat, but fails with yt-dlp: https://www.msn.com/en-us/autos/news/boulder-starts-pilot-program-to-integrate-electric-vehicles-affordable-housing/vi-BB1mNnhD

The video plays for me for the first 5 seconds out of 30, then seems to pause - not sure why (which is why I tried yt-dlp!) Sorry if it is not so helpful....

I don't know about the available JSON extraction approaches here. I'm no devtools guru, but offhand I see some video requests from https://prod-streaming-video-msn-com.akamaized.net/

Here is the -vU debug output:

yt-dlp -vU https://www.msn.com/en-us/autos/news/boulder-starts-pilot-program-to-integrate-electric-vehicles-affordable-housing/vi-BB1mNnhD
[debug] Command-line config: ['-vU', 'https://www.msn.com/en-us/autos/news/boulder-starts-pilot-program-to-integrate-electricvehicles-affordable-housing/vi-BB1mNnhD']
[debug] Encodings: locale UTF-8, fs utf-8, pref UTF-8, out utf-8, error utf-8, screen utf-8
[debug] yt-dlp version stable@2024.05.27 from yt-dlp/yt-dlp [12b248ce6] (pip)
[debug] Python 3.10.12 (CPython x86_64 64bit) - Linux-6.5.0-35-generic-x86_64-with-glibc2.35 (OpenSSL 3.0.2 15 Mar 2022, glibc 2.35)
[debug] exe versions: ffmpeg 4.4.2 (setts), ffprobe 4.4.2
[debug] Optional libraries: Cryptodome-3.18.0, brotli-1.0.9, certifi-2020.06.20, mutagen-1.46.0, requests-2.31.0, secretstorage-3.3.1, sqlite3-3.37.2, urllib3-2.1.0, websockets-12.0
[debug] Proxy map: {}
[debug] Request Handlers: urllib, requests, websockets
[debug] Loaded 1820 extractors
[debug] Fetching release info: https://api.github.com/repos/yt-dlp/yt-dlp/releases/latest
Latest version: stable@2024.05.27 from yt-dlp/yt-dlp
yt-dlp is up to date (stable@2024.05.27 from yt-dlp/yt-dlp)
WARNING: The program functionality for this site has been marked as broken, and will probably not work.
[MSN] Extracting URL: https://www.msn.com/en-us/autos/news/boulder-starts-pilot-program-to-integrate-electric-vehicles-affordable-housing/vi-BB1mNnhD
[MSN] boulder-starts-pilot-program-to-integrate-electric-vehicles-affordable-housing: Downloading webpage
ERROR: [MSN] BB1mNnhD: Unable to extract error; please report this issue on  https://github.com/yt-dlp/yt-dlp/issues?q= , filling out the appropriate issue template. Confirm you are on the latest version using  yt-dlp -U
  File "/home/neal/.local/lib/python3.10/site-packages/yt_dlp/extractor/common.py", line 734, in extract
    ie_result = self._real_extract(url)
  File "/home/neal/.local/lib/python3.10/site-packages/yt_dlp/extractor/msn.py", line 163, in _real_extract
    error = unescapeHTML(self._search_regex(
  File "/home/neal/.local/lib/python3.10/site-packages/yt_dlp/extractor/common.py", line 1327, in _search_regex
    raise RegexNotFoundError('Unable to extract %s' % _name)
dirkf commented 4 months ago

That page is fine with the work-in-progress yt-dl extractor, or was while it was available (now 404). However the asset data is still available; your media link is (-f best) https://prod-streaming-video-msn-com.akamaized.net/ab5cb567-c134-4630-b9f5-521fde0897f1/3e65a2bf-1da5-41eb-ad3e-d3583a926d47.mp4, which I played for the full 30s.