ytdl-org / youtube-dl

Command-line program to download videos from YouTube.com and other video sites
http://ytdl-org.github.io/youtube-dl/
The Unlicense
132.28k stars 10.03k forks source link

BBC News stories no longer showing videos to download (+BBC Sports parsing failure) #29926

Closed GreenReaper closed 2 years ago

GreenReaper commented 3 years ago

Checklist

Verbose log

[debug] System config: []
[debug] User config: ['--ffmpeg-location', 'C:\\Program Files\\ffmpeg-20200311-36aaee2-win64-shared\\bin', '-f', '137+bestaudio/298+bestaudio/136+bestaudio/135+bestaudio/134+bestaudio/DASH-VIDEO-1+bestaudio/html5-video-high+html5-audio-high/best/bestvideo+bestaudio', '--write-sub', '--convert-subs', 'srt', '--embed-subs', '--fragment-retries', 'infinite', '--retries', 'infinite']
[debug] Custom config: []
[debug] Command-line args: ['-v', 'https://www.bbc.co.uk/news/business-58423705']
[debug] Encodings: locale cp1252, fs utf-8, out utf-8, pref cp1252
[debug] youtube-dl version 2021.06.06
[debug] Python version 3.9.7 (CPython) - Windows-10-10.0.19041-SP0
[debug] exe versions: ffmpeg git-2020-03-11-36aaee2, ffprobe git-2020-03-11-36aaee2, rtmpdump 2.4-20151223-gfa8646d-GnuTLS_3.5.12-i686-static
[debug] Proxy map: {}
[bbc] business-58423705: Downloading webpage
[download] Downloading playlist: CEO Secrets: The bra boss busting stereotypes
[bbc] playlist CEO Secrets: The bra boss busting stereotypes: Collected 0 video ids (downloading 0 of them)
[download] Finished downloading playlist: CEO Secrets: The bra boss busting stereotypes

Description

In the last couple of days, BBC News stories such as this and this are parsed as having zero videos, despite having playable videos. The issue does not appear to impact video-centric "/av/ pages such as this and this (a dedicated page for the second link above).

I use YouTube-dl for this because my netbook struggles to play these videos in the browser itself, but MPC-HC can do it.


This issue presents differently to that of BBC Sports stories with videos, which appear not to be parsed correctly at all:

[bbc] 58404777: Downloading webpage
ERROR: Unable to extract playlist data; please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see  https://yt-dl.org/update  on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.
Traceback (most recent call last):
  File "c:\program files\python39\lib\site-packages\youtube_dl\YoutubeDL.py", line 815, in wrapper
    return func(self, *args, **kwargs)
  File "c:\program files\python39\lib\site-packages\youtube_dl\YoutubeDL.py", line 836, in __extract_info
    ie_result = ie.extract(url)
  File "c:\program files\python39\lib\site-packages\youtube_dl\extractor\common.py", line 534, in extract
    ie_result = self._real_extract(url)
  File "c:\program files\python39\lib\site-packages\youtube_dl\extractor\bbc.py", line 1253, in _real_extract
    self._search_regex(
  File "c:\program files\python39\lib\site-packages\youtube_dl\extractor\common.py", line 1012, in _search_regex
    raise RegexNotFoundError('Unable to extract %s' % _name)
dirkf commented 3 years ago

Unmerged PRs exist for BBC but even with these, the News pages linked above are parsed as playlists with no videos.

For the business news pages, the extractor is looking at the dict-ified JSON page model initial_data sent as the value assigned to the JS variable window.__INITIAL_DATA__. It expects to, and does, find an object x with initial_data['x']['name'] == 'article'. Then it expects to find x['data']['blocks'] and tries to parse the programme id and other metadata from there. In these new pages, the wanted information is in x['data']['content']['model']['blocks'] instead.

The solution is to change the getter lambda x: x['data']['blocks'] in l.1208 of extractor/bbc.py to this tuple: (lambda x: x['data']['blocks'], lambda x: x['data']['content']['model']['blocks'],)

The Sport example doesn't have a video now, but with PR #28577 (unmerged) this page https://www.bbc.co.uk/sport/football/58488393 is extracted as a playlist with one video.

chenjianxiong commented 3 years ago

Just modify line 1169 in bbc.py as below, the problem can be fixed. initial_data = self._parse_json(json.JSONDecoder().decode(self._search_regex( r'window.__INITIAL_DATA__\s=\s(\"{.+?}\");', webpage, 'preload state', default='{}')), playlist_id, fatal=False)

dirkf commented 3 years ago

Really? That pattern doesn't exist in OP's test page.

Which sort of page has that format?

Update: https://www.bbc.com/news/av/world-europe-59468682 (eg, from #30291) has its window.__INITIAL_DATA__ as a string rather than a JSON object, and then a fix like the one above applies.

mnlmsf commented 2 years ago

I used to download those BBC News Headlines every day, does not work anymore

[12/01/21 23:04:48] [debug] System config: []
[12/01/21 23:04:48] [debug] User config: []
[12/01/21 23:04:48] [debug] Custom config: []
[12/01/21 23:04:48] [debug] Command-line args: ['--newline', '-o', 'D:\\Downloads\\%(id)s.%(ext)s', '-f', 'mp4', '--hls-prefer-native', '--verbose', 'https://www.bbc.com/news/av/10462520']
[12/01/21 23:04:48] [debug] Encodings: locale cp1252, fs mbcs, out cp1252, pref cp1252
[12/01/21 23:04:48] [debug] youtube-dl version 2021.06.06
[12/01/21 23:04:48] [debug] Python version 3.4.4 (CPython) - Windows-10-10.0.19041
[12/01/21 23:04:48] [debug] exe versions: ffmpeg 3.3.2, ffprobe 3.3.2
[12/01/21 23:04:48] [debug] Proxy map: {}
[12/01/21 23:04:48] ERROR: Unable to extract playlist data; please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; type  youtube-dl -U  to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.
Vangelis66 commented 2 years ago

I used to download those BBC News Headlines every day, does not work anymore

PR #30292 takes care of that:

youtube-dl -F "https://www.bbc.com/news/av/10462520" => 

[bbc] 10462520: Downloading webpage
[bbc] p0b7mdbv: Downloading media selection JSON
[bbc] p0b7mdbv: Downloading m3u8 information
[bbc] p0b7mdbv: Downloading m3u8 information
[bbc] p0b7mdbv: Downloading m3u8 information
[bbc] p0b7mdbv: Downloading m3u8 information
[bbc] p0b7mdbv: Downloading MPD manifest
[bbc] p0b7mdbv: Downloading MPD manifest
[bbc] p0b7mdbv: Downloading MPD manifest
[bbc] p0b7mdbv: Downloading MPD manifest
[download] Downloading playlist: One-minute World News
[bbc] playlist One-minute World News: Collected 1 video ids (downloading 1 of them)
[download] Downloading video 1 of 1
[info] Available formats for p0b7mdbv:
format code                      extension  resolution note
mf_akamai-audio_eng=96000-0      m4a        audio only [en] DASH audio   96k , m4a_dash container, mp4a.40.5 (48000Hz)
mf_akamai-audio_eng=96000-1      m4a        audio only [en] DASH audio   96k , m4a_dash container, mp4a.40.5 (48000Hz)
mf_cloudfront-audio_eng=96000-0  m4a        audio only [en] DASH audio   96k , m4a_dash container, mp4a.40.5 (48000Hz)
mf_cloudfront-audio_eng=96000-1  m4a        audio only [en] DASH audio   96k , m4a_dash container, mp4a.40.5 (48000Hz)
mf_akamai-video=86000-0          mp4        192x108    DASH video   86k , mp4_dash container, avc3.42C015, 25fps, video only
mf_akamai-video=86000-1          mp4        192x108    DASH video   86k , mp4_dash container, avc3.42C015, 25fps, video only
mf_cloudfront-video=86000-0      mp4        192x108    DASH video   86k , mp4_dash container, avc3.42C015, 25fps, video only
mf_cloudfront-video=86000-1      mp4        192x108    DASH video   86k , mp4_dash container, avc3.42C015, 25fps, video only
mf_akamai-video=156000-0         mp4        256x144    DASH video  156k , mp4_dash container, avc3.42C015, 25fps, video only
mf_akamai-video=156000-1         mp4        256x144    DASH video  156k , mp4_dash container, avc3.42C015, 25fps, video only
mf_cloudfront-video=156000-0     mp4        256x144    DASH video  156k , mp4_dash container, avc3.42C015, 25fps, video only
mf_cloudfront-video=156000-1     mp4        256x144    DASH video  156k , mp4_dash container, avc3.42C015, 25fps, video only
mf_akamai-video=281000-0         mp4        384x216    DASH video  281k , mp4_dash container, avc3.42C015, 25fps, video only
mf_akamai-video=281000-1         mp4        384x216    DASH video  281k , mp4_dash container, avc3.42C015, 25fps, video only
mf_cloudfront-video=281000-0     mp4        384x216    DASH video  281k , mp4_dash container, avc3.42C015, 25fps, video only
mf_cloudfront-video=281000-1     mp4        384x216    DASH video  281k , mp4_dash container, avc3.42C015, 25fps, video only
mf_akamai-video=437000-0         mp4        512x288    DASH video  437k , mp4_dash container, avc3.4D4015, 25fps, video only
mf_akamai-video=437000-1         mp4        512x288    DASH video  437k , mp4_dash container, avc3.4D4015, 25fps, video only
mf_cloudfront-video=437000-0     mp4        512x288    DASH video  437k , mp4_dash container, avc3.4D4015, 25fps, video only
mf_cloudfront-video=437000-1     mp4        512x288    DASH video  437k , mp4_dash container, avc3.4D4015, 25fps, video only
mf_akamai-video=827000-0         mp4        704x396    DASH video  827k , mp4_dash container, avc3.4D401F, 25fps, video only
mf_akamai-video=827000-1         mp4        704x396    DASH video  827k , mp4_dash container, avc3.4D401F, 25fps, video only
mf_cloudfront-video=827000-0     mp4        704x396    DASH video  827k , mp4_dash container, avc3.4D401F, 25fps, video only
mf_cloudfront-video=827000-1     mp4        704x396    DASH video  827k , mp4_dash container, avc3.4D401F, 25fps, video only
mf_akamai-video=1604000-0        mp4        960x540    DASH video 1604k , mp4_dash container, avc3.64001F, 25fps, video only
mf_akamai-video=1604000-1        mp4        960x540    DASH video 1604k , mp4_dash container, avc3.64001F, 25fps, video only
mf_cloudfront-video=1604000-0    mp4        960x540    DASH video 1604k , mp4_dash container, avc3.64001F, 25fps, video only
mf_cloudfront-video=1604000-1    mp4        960x540    DASH video 1604k , mp4_dash container, avc3.64001F, 25fps, video only
mf_akamai-0                      mp4        256x144     224k , h264
mf_akamai-1                      mp4        256x144     224k , h264
mf_cloudfront-0                  mp4        256x144     224k , h264
mf_cloudfront-1                  mp4        256x144     224k , h264
mf_akamai-349-0                  mp4        384x216     349k , avc1.42C015@ 281k, 25.0fps, mp4a.40.5@ 48k
mf_akamai-349-1                  mp4        384x216     349k , avc1.42C015@ 281k, 25.0fps, mp4a.40.5@ 48k
mf_cloudfront-349-0              mp4        384x216     349k , avc1.42C015@ 281k, 25.0fps, mp4a.40.5@ 48k
mf_cloudfront-349-1              mp4        384x216     349k , avc1.42C015@ 281k, 25.0fps, mp4a.40.5@ 48k
mf_akamai-2                      mp4        448x252     543k , h264
mf_akamai-3                      mp4        448x252     543k , h264
mf_cloudfront-2                  mp4        448x252     543k , h264
mf_cloudfront-3                  mp4        448x252     543k , h264
mf_akamai-565-0                  mp4        512x288     565k , avc1.4D4015@ 437k, 25.0fps, mp4a.40.5@ 96k
mf_akamai-565-1                  mp4        512x288     565k , avc1.4D4015@ 437k, 25.0fps, mp4a.40.5@ 96k
mf_cloudfront-565-0              mp4        512x288     565k , avc1.4D4015@ 437k, 25.0fps, mp4a.40.5@ 96k
mf_cloudfront-565-1              mp4        512x288     565k , avc1.4D4015@ 437k, 25.0fps, mp4a.40.5@ 96k
mf_akamai-4                      mp4        640x360     800k , h264
mf_akamai-5                      mp4        640x360     800k , h264
mf_cloudfront-4                  mp4        640x360     800k , h264
mf_cloudfront-5                  mp4        640x360     800k , h264
mf_akamai-979-0                  mp4        704x396     979k , avc1.4D401F@ 827k, 25.0fps, mp4a.40.5@ 96k
mf_akamai-979-1                  mp4        704x396     979k , avc1.4D401F@ 827k, 25.0fps, mp4a.40.5@ 96k
mf_cloudfront-979-0              mp4        704x396     979k , avc1.4D401F@ 827k, 25.0fps, mp4a.40.5@ 96k
mf_cloudfront-979-1              mp4        704x396     979k , avc1.4D401F@ 827k, 25.0fps, mp4a.40.5@ 96k
mf_akamai-1802-0                 mp4        960x540    1802k , avc1.64001F@1604k, 25.0fps, mp4a.40.5@ 96k
mf_akamai-1802-1                 mp4        960x540    1802k , avc1.64001F@1604k, 25.0fps, mp4a.40.5@ 96k
mf_cloudfront-1802-0             mp4        960x540    1802k , avc1.64001F@1604k, 25.0fps, mp4a.40.5@ 96k
mf_cloudfront-1802-1             mp4        960x540    1802k , avc1.64001F@1604k, 25.0fps, mp4a.40.5@ 96k (best)
[download] Finished downloading playlist: One-minute World News