Open ghost opened 1 year ago
This Brightcove URL based on the <video-js>
element in the BFI page, with/out --referer
says "Access to this resource is forbidden by access policy.".
More research needed.
Ah interesting: https://players.brightcove.net/6057949427001/hndK61Wvr_default/index.html?videoId=ref:VqbHRudDoSzz5Tq0sYyT63qTMNaUlYWG works for me. I was missing the "ref:" prefix on the video ID. I'll see if I can edit one of the extractors to work with this.
Actually scrap that - I get the same message as you when trying this with yt_dlp. What I reported above was true when trying that URL in Firefox.
In a new FF private window each time, you could try the Brightcove URL and then navigating to that URL from the film page. If either of these work, you could retry with browser tools on: then try and/or post the curl command that you get from the Copy to ... Curl command context menu for the Brightcove URL in the Network tab. Otherwise you must have some authorisation context (cookies, presumably) in your plain FF session.
Just setting UA 'Mozilla/5.0' doesn't seem to work.
In private FF tab (which works): curl "https://players.brightcove.net/6057949427001/hndK61Wvr_default/index.html?videoId=ref:VqbHRudDoSzz5Tq0sYyT63qTMNaUlYWG" -H "User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/111.0" -H "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8" -H "Accept-Language: en-GB,en;q=0.5" -H "Accept-Encoding: gzip, deflate, br" -H "DNT: 1" -H "Connection: keep-alive" -H "Upgrade-Insecure-Requests: 1" -H "Sec-Fetch-Dest: document" -H "Sec-Fetch-Mode: navigate" -H "Sec-Fetch-Site: none" -H "Sec-Fetch-User: ?1"
I did also try with --cookies-from-browser firefox
Perplexing one this...
The curl command works for me. Adding the UA and Sec-Fetch-...
headers to the yt-dlp command still fails. The URL that's failing isn't the player URL as above but this one (--print-traffic
reveals it):
https://edge.api.brightcove.com/playback/v1/accounts/6057949427001/videos/ref:VqbHRudDoSzz5Tq0sYyT63qTMNaUlYWG
I think the API request is failing because it needs an Origin
header of https://player.bfi.org.uk
, which the BrightcoveNew
extractor will handle if you smuggle it {'referrer': url}
I'm geo-blocked, but I think you could do something like this
diff --git a/yt_dlp/extractor/bfi.py b/yt_dlp/extractor/bfi.py
index 76f0516a4..3f4011c5c 100644
--- a/yt_dlp/extractor/bfi.py
+++ b/yt_dlp/extractor/bfi.py
@@ -1,7 +1,8 @@
import re
+from .brightcove import BrightcoveNewIE
from .common import InfoExtractor
-from ..utils import extract_attributes
+from ..utils import extract_attributes, smuggle_url
class BFIPlayerIE(InfoExtractor):
@@ -23,6 +24,17 @@ def _real_extract(self, url):
video_id = self._match_id(url)
webpage = self._download_webpage(url, video_id)
entries = []
+ for video_js in re.findall(r'<video-js[^>]+>', webpage):
+ player_attr = extract_attributes(video_js)
+ bc_id = player_attr.get('data-ref-id')
+ if not bc_id:
+ continue
+ bc_player = player_attr.get('data-pid') or player_attr.get('data-player') or 'hndK61Wvr'
+ bc_account = player_attr.get('data-acid') or '6057949427001'
+ bc_embed = player_attr.get('data-embed') or 'default'
+ entries.append(self.url_result(smuggle_url(
+ f'https://players.brightcove.net/{bc_account}/{bc_player}_{bc_embed}/index.html?videoId=ref:{bc_id}',
+ {'referrer': url}), BrightcoveNewIE))
for player_el in re.findall(r'(?s)<[^>]+class="player"[^>]*>', webpage):
player_attr = extract_attributes(player_el)
ooyala_id = player_attr.get('data-video-id')
I think the API request is failing because it needs an
Origin
header ofhttps://player.bfi.org.uk
, which theBrightcoveNew
extractor will handle if you smuggle it{'referrer': url}
I'm geo-blocked, but I think you could do something like this
diff --git a/yt_dlp/extractor/bfi.py b/yt_dlp/extractor/bfi.py index 76f0516a4..3f4011c5c 100644 --- a/yt_dlp/extractor/bfi.py +++ b/yt_dlp/extractor/bfi.py @@ -1,7 +1,8 @@ import re +from .brightcove import BrightcoveNewIE from .common import InfoExtractor -from ..utils import extract_attributes +from ..utils import extract_attributes, smuggle_url class BFIPlayerIE(InfoExtractor): @@ -23,6 +24,17 @@ def _real_extract(self, url): video_id = self._match_id(url) webpage = self._download_webpage(url, video_id) entries = [] + for video_js in re.findall(r'<video-js[^>]+>', webpage): + player_attr = extract_attributes(video_js) + bc_id = player_attr.get('data-ref-id') + if not bc_id: + continue + bc_player = player_attr.get('data-pid') or player_attr.get('data-player') or 'hndK61Wvr' + bc_account = player_attr.get('data-acid') or '6057949427001' + bc_embed = player_attr.get('data-embed') or 'default' + entries.append(self.url_result(smuggle_url( + f'https://players.brightcove.net/{bc_account}/{bc_player}_{bc_embed}/index.html?videoId=ref:{bc_id}', + {'referrer': url}), BrightcoveNewIE)) for player_el in re.findall(r'(?s)<[^>]+class="player"[^>]*>', webpage): player_attr = extract_attributes(player_el) ooyala_id = player_attr.get('data-video-id')
That's got it! Thanks - however it's downloading all videos on the page (such as extras and links) - I've worked around that with a carefully placed break in the for loop. Arguably that's a feature though - not a bug. The only other thing I noticed was the "ref:" was getting added to the filename which results in an unprintable character since ":" isn't allowed in Windows filenames.
Brightcove ought to have a public class method to construct its URL, or understand brightcove:new:{account_id}:{player_id}:{embed}:{content_type}:{content_id_or_ref}
, eg 'brightcove:new:6057949427001:hndK61Wvr:default:video:ref:VqbHRudDoSzz5Tq0sYyT63qTMNaUlYWG'
.
This does the latter trick for yt-dlp's BrightcoveNewIE
:
--- old/yt_dlp/extractor/brightcove.py
+++ new/yt_dlp/extractor/brightcove.py
@@ -620,7 +620,7 @@
class BrightcoveNewIE(BrightcoveNewBaseIE):
IE_NAME = 'brightcove:new'
- _VALID_URL = r'https?://players\.brightcove\.net/(?P<account_id>\d+)/(?P<player_id>[^/]+)_(?P<embed>[^/]+)/index\.html\?.*(?P<content_type>video|playlist)Id=(?P<video_id>\d+|ref:[^&]+)'
+ _VALID_URL = r'(?:brightcove:new|(?P<u>https?)):(?(u)//players\.brightcove\.net/)(?P<account_id>\d+)(?(u)/|:)(?P<player_id>[^/]+)(?(u)_|:)(?P<embed>[^/]+)(?(u)/index\.html\?.*|:)(?P<content_type>video|playlist)(?(u)Id=|:)(?P<video_id>\d+|ref:[^&]+)'
_TESTS = [{
'url': 'http://players.brightcove.net/929656772001/e41d32dc-ec74-459e-a845-6c69f7b724ea_default/index.html?videoId=4463358922001',
'md5': 'c8100925723840d4b0d243f7025703be',
@@ -862,7 +862,7 @@
'ip_blocks': smuggled_data.get('geo_ip_blocks'),
})
- account_id, player_id, embed, content_type, video_id = self._match_valid_url(url).groups()
+ account_id, player_id, embed, content_type, video_id = re.match(self._VALID_URL, url).groups()[1:]
policy_key_id = '%s_%s' % (account_id, player_id)
policy_key = self.cache.load('brightcove', policy_key_id)
In the that case there's no need to import the extractor and the ie
, if passed to url_result()
, can just be the name/ie_key.
Is there a test example with multiple <video-js>
elements?
Is there a test example with multiple
<video-js>
elements?
The OP's URL has 2
OK, the test video in the extractor has only one. In that case, discriminate by the data-video-type
: film
vs extra
(maybe other lead content, apart from film
, exists?). We don't appear to have invented a metadata item equivalent to this.
Assuming that it should be possible to extract all the videos if wanted, these might be possible solutions:
film
when the Watch URL (containing query param play-film
) is being processed, and a playlist for all content otherwiseExtra:
to the title of an item that isn't a film
(so a matching filter can exclude it)--no-playlist
as meaning that non-film
s should be omitted (but that only makes sense if pages only ever have one lead item, which I doubt).The ref:
issue occurs because Brightcove has a numeric ID for videos. The ref:
scheme allows them to support a per-account ID scheme. The Brightcove extractor could be modified:
ref:{video_id}
into {account_id}_{video_id}
(similar to VK), orIn the above PR, I chose:
[make Brightcove] understand
brightcove:new:{account_id}:{player_id}:{embed}:{content_type}:{content_id_or_ref}
and
- return the film when the Watch URL (containing query param play-film) is being processed, and a playlist for all content otherwise
and (using force_videoid
)
- support a smuggled video id, as in the generic extractor.
Very nice work here. Thanks to both @dirkf and @bashonly .
This https://player.bfi.org.uk/free/film/watch-education-of-the-deaf-1946-online is an example of an "unrated" film that requires a click-through on the website, but I can't see how the Brightcove data is being modified when that happens. Otherwise "ERROR: The policy key provided does not permit this account or video, or the requested resource is inactive.".
DO NOT REMOVE OR SKIP THE ISSUE TEMPLATE
Checklist
Region
United Kingdom
Provide a description that is worded well enough to be understood
BFI Player now uses Brightcove (sigh) and so the plugin needs updating.
I've been trying to work through it myself to see if it's similar to one of the other sites that use the BrightcoveNewIE extractor but haven't got far yet.
Inspecting the network traffic with Firefox, I can get the link to the m3u8 and then play it just fine in VLC and even rip it with ffmpeg, so it should be possible to extract. However, I'm not sure if this uses a time-limited token which complicates matters....
Example site: https://player.bfi.org.uk/free/film/watch-the-beatles-1963-online
Provide verbose output that clearly demonstrates the problem
yt-dlp -vU <your command line>
)'verbose': True
toYoutubeDL
params instead[debug] Command-line config
) and insert it belowComplete Verbose Output