Closed GreenReaper closed 2 years ago
Unmerged PRs exist for BBC but even with these, the News pages linked above are parsed as playlists with no videos.
For the business news pages, the extractor is looking at the dict-ified JSON page model initial_data
sent as the value assigned to the JS variable window.__INITIAL_DATA__
. It expects to, and does, find an object x with initial_data['x']['name'] == 'article'
. Then it expects to find x['data']['blocks']
and tries to parse the programme id and other metadata from there. In these new pages, the wanted information is in x['data']['content']['model']['blocks']
instead.
The solution is to change the getter lambda x: x['data']['blocks']
in l.1208 of extractor/bbc.py
to this tuple:
(lambda x: x['data']['blocks'], lambda x: x['data']['content']['model']['blocks'],)
The Sport example doesn't have a video now, but with PR #28577 (unmerged) this page https://www.bbc.co.uk/sport/football/58488393 is extracted as a playlist with one video.
Just modify line 1169 in bbc.py as below, the problem can be fixed. initial_data = self._parse_json(json.JSONDecoder().decode(self._search_regex( r'window.__INITIAL_DATA__\s=\s(\"{.+?}\");', webpage, 'preload state', default='{}')), playlist_id, fatal=False)
Really? That pattern doesn't exist in OP's test page.
Which sort of page has that format?
Update: https://www.bbc.com/news/av/world-europe-59468682 (eg, from #30291) has its window.__INITIAL_DATA__
as a string rather than a JSON object, and then a fix like the one above applies.
I used to download those BBC News Headlines every day, does not work anymore
[12/01/21 23:04:48] [debug] System config: []
[12/01/21 23:04:48] [debug] User config: []
[12/01/21 23:04:48] [debug] Custom config: []
[12/01/21 23:04:48] [debug] Command-line args: ['--newline', '-o', 'D:\\Downloads\\%(id)s.%(ext)s', '-f', 'mp4', '--hls-prefer-native', '--verbose', 'https://www.bbc.com/news/av/10462520']
[12/01/21 23:04:48] [debug] Encodings: locale cp1252, fs mbcs, out cp1252, pref cp1252
[12/01/21 23:04:48] [debug] youtube-dl version 2021.06.06
[12/01/21 23:04:48] [debug] Python version 3.4.4 (CPython) - Windows-10-10.0.19041
[12/01/21 23:04:48] [debug] exe versions: ffmpeg 3.3.2, ffprobe 3.3.2
[12/01/21 23:04:48] [debug] Proxy map: {}
[12/01/21 23:04:48] ERROR: Unable to extract playlist data; please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; type youtube-dl -U to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.
I used to download those BBC News Headlines every day, does not work anymore
PR #30292 takes care of that:
youtube-dl -F "https://www.bbc.com/news/av/10462520" =>
[bbc] 10462520: Downloading webpage
[bbc] p0b7mdbv: Downloading media selection JSON
[bbc] p0b7mdbv: Downloading m3u8 information
[bbc] p0b7mdbv: Downloading m3u8 information
[bbc] p0b7mdbv: Downloading m3u8 information
[bbc] p0b7mdbv: Downloading m3u8 information
[bbc] p0b7mdbv: Downloading MPD manifest
[bbc] p0b7mdbv: Downloading MPD manifest
[bbc] p0b7mdbv: Downloading MPD manifest
[bbc] p0b7mdbv: Downloading MPD manifest
[download] Downloading playlist: One-minute World News
[bbc] playlist One-minute World News: Collected 1 video ids (downloading 1 of them)
[download] Downloading video 1 of 1
[info] Available formats for p0b7mdbv:
format code extension resolution note
mf_akamai-audio_eng=96000-0 m4a audio only [en] DASH audio 96k , m4a_dash container, mp4a.40.5 (48000Hz)
mf_akamai-audio_eng=96000-1 m4a audio only [en] DASH audio 96k , m4a_dash container, mp4a.40.5 (48000Hz)
mf_cloudfront-audio_eng=96000-0 m4a audio only [en] DASH audio 96k , m4a_dash container, mp4a.40.5 (48000Hz)
mf_cloudfront-audio_eng=96000-1 m4a audio only [en] DASH audio 96k , m4a_dash container, mp4a.40.5 (48000Hz)
mf_akamai-video=86000-0 mp4 192x108 DASH video 86k , mp4_dash container, avc3.42C015, 25fps, video only
mf_akamai-video=86000-1 mp4 192x108 DASH video 86k , mp4_dash container, avc3.42C015, 25fps, video only
mf_cloudfront-video=86000-0 mp4 192x108 DASH video 86k , mp4_dash container, avc3.42C015, 25fps, video only
mf_cloudfront-video=86000-1 mp4 192x108 DASH video 86k , mp4_dash container, avc3.42C015, 25fps, video only
mf_akamai-video=156000-0 mp4 256x144 DASH video 156k , mp4_dash container, avc3.42C015, 25fps, video only
mf_akamai-video=156000-1 mp4 256x144 DASH video 156k , mp4_dash container, avc3.42C015, 25fps, video only
mf_cloudfront-video=156000-0 mp4 256x144 DASH video 156k , mp4_dash container, avc3.42C015, 25fps, video only
mf_cloudfront-video=156000-1 mp4 256x144 DASH video 156k , mp4_dash container, avc3.42C015, 25fps, video only
mf_akamai-video=281000-0 mp4 384x216 DASH video 281k , mp4_dash container, avc3.42C015, 25fps, video only
mf_akamai-video=281000-1 mp4 384x216 DASH video 281k , mp4_dash container, avc3.42C015, 25fps, video only
mf_cloudfront-video=281000-0 mp4 384x216 DASH video 281k , mp4_dash container, avc3.42C015, 25fps, video only
mf_cloudfront-video=281000-1 mp4 384x216 DASH video 281k , mp4_dash container, avc3.42C015, 25fps, video only
mf_akamai-video=437000-0 mp4 512x288 DASH video 437k , mp4_dash container, avc3.4D4015, 25fps, video only
mf_akamai-video=437000-1 mp4 512x288 DASH video 437k , mp4_dash container, avc3.4D4015, 25fps, video only
mf_cloudfront-video=437000-0 mp4 512x288 DASH video 437k , mp4_dash container, avc3.4D4015, 25fps, video only
mf_cloudfront-video=437000-1 mp4 512x288 DASH video 437k , mp4_dash container, avc3.4D4015, 25fps, video only
mf_akamai-video=827000-0 mp4 704x396 DASH video 827k , mp4_dash container, avc3.4D401F, 25fps, video only
mf_akamai-video=827000-1 mp4 704x396 DASH video 827k , mp4_dash container, avc3.4D401F, 25fps, video only
mf_cloudfront-video=827000-0 mp4 704x396 DASH video 827k , mp4_dash container, avc3.4D401F, 25fps, video only
mf_cloudfront-video=827000-1 mp4 704x396 DASH video 827k , mp4_dash container, avc3.4D401F, 25fps, video only
mf_akamai-video=1604000-0 mp4 960x540 DASH video 1604k , mp4_dash container, avc3.64001F, 25fps, video only
mf_akamai-video=1604000-1 mp4 960x540 DASH video 1604k , mp4_dash container, avc3.64001F, 25fps, video only
mf_cloudfront-video=1604000-0 mp4 960x540 DASH video 1604k , mp4_dash container, avc3.64001F, 25fps, video only
mf_cloudfront-video=1604000-1 mp4 960x540 DASH video 1604k , mp4_dash container, avc3.64001F, 25fps, video only
mf_akamai-0 mp4 256x144 224k , h264
mf_akamai-1 mp4 256x144 224k , h264
mf_cloudfront-0 mp4 256x144 224k , h264
mf_cloudfront-1 mp4 256x144 224k , h264
mf_akamai-349-0 mp4 384x216 349k , avc1.42C015@ 281k, 25.0fps, mp4a.40.5@ 48k
mf_akamai-349-1 mp4 384x216 349k , avc1.42C015@ 281k, 25.0fps, mp4a.40.5@ 48k
mf_cloudfront-349-0 mp4 384x216 349k , avc1.42C015@ 281k, 25.0fps, mp4a.40.5@ 48k
mf_cloudfront-349-1 mp4 384x216 349k , avc1.42C015@ 281k, 25.0fps, mp4a.40.5@ 48k
mf_akamai-2 mp4 448x252 543k , h264
mf_akamai-3 mp4 448x252 543k , h264
mf_cloudfront-2 mp4 448x252 543k , h264
mf_cloudfront-3 mp4 448x252 543k , h264
mf_akamai-565-0 mp4 512x288 565k , avc1.4D4015@ 437k, 25.0fps, mp4a.40.5@ 96k
mf_akamai-565-1 mp4 512x288 565k , avc1.4D4015@ 437k, 25.0fps, mp4a.40.5@ 96k
mf_cloudfront-565-0 mp4 512x288 565k , avc1.4D4015@ 437k, 25.0fps, mp4a.40.5@ 96k
mf_cloudfront-565-1 mp4 512x288 565k , avc1.4D4015@ 437k, 25.0fps, mp4a.40.5@ 96k
mf_akamai-4 mp4 640x360 800k , h264
mf_akamai-5 mp4 640x360 800k , h264
mf_cloudfront-4 mp4 640x360 800k , h264
mf_cloudfront-5 mp4 640x360 800k , h264
mf_akamai-979-0 mp4 704x396 979k , avc1.4D401F@ 827k, 25.0fps, mp4a.40.5@ 96k
mf_akamai-979-1 mp4 704x396 979k , avc1.4D401F@ 827k, 25.0fps, mp4a.40.5@ 96k
mf_cloudfront-979-0 mp4 704x396 979k , avc1.4D401F@ 827k, 25.0fps, mp4a.40.5@ 96k
mf_cloudfront-979-1 mp4 704x396 979k , avc1.4D401F@ 827k, 25.0fps, mp4a.40.5@ 96k
mf_akamai-1802-0 mp4 960x540 1802k , avc1.64001F@1604k, 25.0fps, mp4a.40.5@ 96k
mf_akamai-1802-1 mp4 960x540 1802k , avc1.64001F@1604k, 25.0fps, mp4a.40.5@ 96k
mf_cloudfront-1802-0 mp4 960x540 1802k , avc1.64001F@1604k, 25.0fps, mp4a.40.5@ 96k
mf_cloudfront-1802-1 mp4 960x540 1802k , avc1.64001F@1604k, 25.0fps, mp4a.40.5@ 96k (best)
[download] Finished downloading playlist: One-minute World News
Checklist
Verbose log
Description
In the last couple of days, BBC News stories such as this and this are parsed as having zero videos, despite having playable videos. The issue does not appear to impact video-centric "/av/ pages such as this and this (a dedicated page for the second link above).
I use YouTube-dl for this because my netbook struggles to play these videos in the browser itself, but MPC-HC can do it.
This issue presents differently to that of BBC Sports stories with videos, which appear not to be parsed correctly at all: