yt-dlp / yt-dlp

A feature-rich command-line audio/video downloader
https://discord.gg/H5MNcFW63r
The Unlicense
91.37k stars 7.11k forks source link

BFIPlayer needs updating to support new Brightcove player #6822

Open ghost opened 1 year ago

ghost commented 1 year ago

DO NOT REMOVE OR SKIP THE ISSUE TEMPLATE

Checklist

Region

United Kingdom

Provide a description that is worded well enough to be understood

BFI Player now uses Brightcove (sigh) and so the plugin needs updating.

I've been trying to work through it myself to see if it's similar to one of the other sites that use the BrightcoveNewIE extractor but haven't got far yet.

Inspecting the network traffic with Firefox, I can get the link to the m3u8 and then play it just fine in VLC and even rip it with ffmpeg, so it should be possible to extract. However, I'm not sure if this uses a time-limited token which complicates matters....

Example site: https://player.bfi.org.uk/free/film/watch-the-beatles-1963-online

Provide verbose output that clearly demonstrates the problem

Complete Verbose Output

$ ./yt-dlp.sh -vU https://player.bfi.org.uk/free/film/watch-the-beatles-1963-online
[debug] Command-line config: ['-vU', 'https://player.bfi.org.uk/free/film/watch-the-beatles-1963-online']
[debug] Encodings: locale UTF-8, fs utf-8, pref UTF-8, out utf-8, error utf-8, screen utf-8
[debug] yt-dlp version stable@2023.03.04 [392389b7d] (source)
[debug] Lazy loading extractors is disabled
[debug] Git HEAD: 7666b9360
[debug] Python 3.8.2 (CPython x86_64 64bit) - Linux-5.10.16.3-microsoft-standard-WSL2-x86_64-with-glibc2.29 (OpenSSL 1.1.1f  31 Mar 2020, glibc 2.31)
[debug] exe versions: none
[debug] Optional libraries: certifi-2019.11.28, secretstorage-2.3.1, sqlite3-2.6.0
[debug] Proxy map: {}
[debug] Loaded 1797 extractors
[debug] Fetching release info: https://api.github.com/repos/yt-dlp/yt-dlp/releases/latest
Available version: stable@2023.03.04, Current version: stable@2023.03.04
yt-dlp is up to date (stable@2023.03.04)
[bfi:player] Extracting URL: https://player.bfi.org.uk/free/film/watch-the-beatles-1963-online
[bfi:player] the-beatles-1963: Downloading webpage
[download] Downloading playlist: <Untitled>
[bfi:player] Playlist <Untitled>: Downloading 0 items of 0
[download] Finished downloading playlist: <Untitled>
dirkf commented 1 year ago

This Brightcove URL based on the <video-js> element in the BFI page, with/out --referer says "Access to this resource is forbidden by access policy.".

https://players.brightcove.net/6057949427001/hndK61Wvr_default/index.html?videoId=ref:VqbHRudDoSzz5Tq0sYyT63qTMNaUlYWG

More research needed.

ghost commented 1 year ago

Ah interesting: https://players.brightcove.net/6057949427001/hndK61Wvr_default/index.html?videoId=ref:VqbHRudDoSzz5Tq0sYyT63qTMNaUlYWG works for me. I was missing the "ref:" prefix on the video ID. I'll see if I can edit one of the extractors to work with this.

ghost commented 1 year ago

Actually scrap that - I get the same message as you when trying this with yt_dlp. What I reported above was true when trying that URL in Firefox.

dirkf commented 1 year ago

In a new FF private window each time, you could try the Brightcove URL and then navigating to that URL from the film page. If either of these work, you could retry with browser tools on: then try and/or post the curl command that you get from the Copy to ... Curl command context menu for the Brightcove URL in the Network tab. Otherwise you must have some authorisation context (cookies, presumably) in your plain FF session.

Just setting UA 'Mozilla/5.0' doesn't seem to work.

ghost commented 1 year ago

In private FF tab (which works): curl "https://players.brightcove.net/6057949427001/hndK61Wvr_default/index.html?videoId=ref:VqbHRudDoSzz5Tq0sYyT63qTMNaUlYWG" -H "User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/111.0" -H "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8" -H "Accept-Language: en-GB,en;q=0.5" -H "Accept-Encoding: gzip, deflate, br" -H "DNT: 1" -H "Connection: keep-alive" -H "Upgrade-Insecure-Requests: 1" -H "Sec-Fetch-Dest: document" -H "Sec-Fetch-Mode: navigate" -H "Sec-Fetch-Site: none" -H "Sec-Fetch-User: ?1"

I did also try with --cookies-from-browser firefox

Perplexing one this...

dirkf commented 1 year ago

The curl command works for me. Adding the UA and Sec-Fetch-... headers to the yt-dlp command still fails. The URL that's failing isn't the player URL as above but this one (--print-traffic reveals it): https://edge.api.brightcove.com/playback/v1/accounts/6057949427001/videos/ref:VqbHRudDoSzz5Tq0sYyT63qTMNaUlYWG

bashonly commented 1 year ago

I think the API request is failing because it needs an Origin header of https://player.bfi.org.uk, which the BrightcoveNew extractor will handle if you smuggle it {'referrer': url}

I'm geo-blocked, but I think you could do something like this

diff --git a/yt_dlp/extractor/bfi.py b/yt_dlp/extractor/bfi.py
index 76f0516a4..3f4011c5c 100644
--- a/yt_dlp/extractor/bfi.py
+++ b/yt_dlp/extractor/bfi.py
@@ -1,7 +1,8 @@
 import re

+from .brightcove import BrightcoveNewIE
 from .common import InfoExtractor
-from ..utils import extract_attributes
+from ..utils import extract_attributes, smuggle_url

 class BFIPlayerIE(InfoExtractor):
@@ -23,6 +24,17 @@ def _real_extract(self, url):
         video_id = self._match_id(url)
         webpage = self._download_webpage(url, video_id)
         entries = []
+        for video_js in re.findall(r'<video-js[^>]+>', webpage):
+            player_attr = extract_attributes(video_js)
+            bc_id = player_attr.get('data-ref-id')
+            if not bc_id:
+                continue
+            bc_player = player_attr.get('data-pid') or player_attr.get('data-player') or 'hndK61Wvr'
+            bc_account = player_attr.get('data-acid') or '6057949427001'
+            bc_embed = player_attr.get('data-embed') or 'default'
+            entries.append(self.url_result(smuggle_url(
+                f'https://players.brightcove.net/{bc_account}/{bc_player}_{bc_embed}/index.html?videoId=ref:{bc_id}',
+                {'referrer': url}), BrightcoveNewIE))
         for player_el in re.findall(r'(?s)<[^>]+class="player"[^>]*>', webpage):
             player_attr = extract_attributes(player_el)
             ooyala_id = player_attr.get('data-video-id')
ghost commented 1 year ago

I think the API request is failing because it needs an Origin header of https://player.bfi.org.uk, which the BrightcoveNew extractor will handle if you smuggle it {'referrer': url}

I'm geo-blocked, but I think you could do something like this

diff --git a/yt_dlp/extractor/bfi.py b/yt_dlp/extractor/bfi.py
index 76f0516a4..3f4011c5c 100644
--- a/yt_dlp/extractor/bfi.py
+++ b/yt_dlp/extractor/bfi.py
@@ -1,7 +1,8 @@
 import re

+from .brightcove import BrightcoveNewIE
 from .common import InfoExtractor
-from ..utils import extract_attributes
+from ..utils import extract_attributes, smuggle_url

 class BFIPlayerIE(InfoExtractor):
@@ -23,6 +24,17 @@ def _real_extract(self, url):
         video_id = self._match_id(url)
         webpage = self._download_webpage(url, video_id)
         entries = []
+        for video_js in re.findall(r'<video-js[^>]+>', webpage):
+            player_attr = extract_attributes(video_js)
+            bc_id = player_attr.get('data-ref-id')
+            if not bc_id:
+                continue
+            bc_player = player_attr.get('data-pid') or player_attr.get('data-player') or 'hndK61Wvr'
+            bc_account = player_attr.get('data-acid') or '6057949427001'
+            bc_embed = player_attr.get('data-embed') or 'default'
+            entries.append(self.url_result(smuggle_url(
+                f'https://players.brightcove.net/{bc_account}/{bc_player}_{bc_embed}/index.html?videoId=ref:{bc_id}',
+                {'referrer': url}), BrightcoveNewIE))
         for player_el in re.findall(r'(?s)<[^>]+class="player"[^>]*>', webpage):
             player_attr = extract_attributes(player_el)
             ooyala_id = player_attr.get('data-video-id')

That's got it! Thanks - however it's downloading all videos on the page (such as extras and links) - I've worked around that with a carefully placed break in the for loop. Arguably that's a feature though - not a bug. The only other thing I noticed was the "ref:" was getting added to the filename which results in an unprintable character since ":" isn't allowed in Windows filenames.

dirkf commented 1 year ago

Brightcove ought to have a public class method to construct its URL, or understand brightcove:new:{account_id}:{player_id}:{embed}:{content_type}:{content_id_or_ref}, eg 'brightcove:new:6057949427001:hndK61Wvr:default:video:ref:VqbHRudDoSzz5Tq0sYyT63qTMNaUlYWG'.

This does the latter trick for yt-dlp's BrightcoveNewIE:

--- old/yt_dlp/extractor/brightcove.py
+++ new/yt_dlp/extractor/brightcove.py
@@ -620,7 +620,7 @@

 class BrightcoveNewIE(BrightcoveNewBaseIE):
     IE_NAME = 'brightcove:new'
-    _VALID_URL = r'https?://players\.brightcove\.net/(?P<account_id>\d+)/(?P<player_id>[^/]+)_(?P<embed>[^/]+)/index\.html\?.*(?P<content_type>video|playlist)Id=(?P<video_id>\d+|ref:[^&]+)'
+    _VALID_URL = r'(?:brightcove:new|(?P<u>https?)):(?(u)//players\.brightcove\.net/)(?P<account_id>\d+)(?(u)/|:)(?P<player_id>[^/]+)(?(u)_|:)(?P<embed>[^/]+)(?(u)/index\.html\?.*|:)(?P<content_type>video|playlist)(?(u)Id=|:)(?P<video_id>\d+|ref:[^&]+)'
     _TESTS = [{
         'url': 'http://players.brightcove.net/929656772001/e41d32dc-ec74-459e-a845-6c69f7b724ea_default/index.html?videoId=4463358922001',
         'md5': 'c8100925723840d4b0d243f7025703be',
@@ -862,7 +862,7 @@
             'ip_blocks': smuggled_data.get('geo_ip_blocks'),
         })

-        account_id, player_id, embed, content_type, video_id = self._match_valid_url(url).groups()
+        account_id, player_id, embed, content_type, video_id = re.match(self._VALID_URL, url).groups()[1:]

         policy_key_id = '%s_%s' % (account_id, player_id)
         policy_key = self.cache.load('brightcove', policy_key_id)

In the that case there's no need to import the extractor and the ie, if passed to url_result(), can just be the name/ie_key.

Is there a test example with multiple <video-js> elements?

bashonly commented 1 year ago

Is there a test example with multiple <video-js> elements?

The OP's URL has 2

dirkf commented 1 year ago

OK, the test video in the extractor has only one. In that case, discriminate by the data-video-type: film vs extra (maybe other lead content, apart from film, exists?). We don't appear to have invented a metadata item equivalent to this.

Assuming that it should be possible to extract all the videos if wanted, these might be possible solutions:

dirkf commented 1 year ago

The ref: issue occurs because Brightcove has a numeric ID for videos. The ref: scheme allows them to support a per-account ID scheme. The Brightcove extractor could be modified:

dirkf commented 1 year ago

In the above PR, I chose:

[make Brightcove] understand brightcove:new:{account_id}:{player_id}:{embed}:{content_type}:{content_id_or_ref}

and

  • return the film when the Watch URL (containing query param play-film) is being processed, and a playlist for all content otherwise

and (using force_videoid)

  • support a smuggled video id, as in the generic extractor.
ghost commented 1 year ago

Very nice work here. Thanks to both @dirkf and @bashonly .

dirkf commented 1 year ago

This https://player.bfi.org.uk/free/film/watch-education-of-the-deaf-1946-online is an example of an "unrated" film that requires a click-through on the website, but I can't see how the Brightcove data is being modified when that happens. Otherwise "ERROR: The policy key provided does not permit this account or video, or the requested resource is inactive.".