Open dimitris1962 opened 2 years ago
Geo-restricted
@dirkf how could we help with that? I do have a nordvpn account which I could share and I could also get the source code or anything from an ertflix page if that helps.
As a start, could you get the plain web page code that's downloaded in the browser with JS disabled (otherwise the page will get transformed in ways that yt-dl wouldn't see).
In Mozilla browsers, the developer tools Network tab has a context menu item for each URL to Copy>Copy as cURL, which puts a curl command in the clipboard that replicates the selected request, so you can run that with a -o page.html
to generate it; or Copy>Copy Response should give you the returned page code that you could paste into a file. Either way, attach the file to your response. Ideally also capture and attach the request and response headers, eg using the Save All as HAR option.
Thanks! Ertflix depends on JS to load the mpd/m3u8 file, it does not exist in the original source code.
After the "view" button is clicked an index.mpd
file gets loaded and it points to an title.m3u8
file. Both of the .mpd
and .m3u8
are compatible with youtube-dl.
Apparently it is a two-step task for ertflix, not a big deal though.
ps: I did not attach the results from curl (thanks for the hints) because it is just 140kb of JS.
Unless the video URL is somehow deducible from the original URL combined with stuff extracted from the non-JS page, we have to reverse engineer what the JS is doing and implement that in the extractor.
Further clues could be resources of type XHR fetched before the video URL when looking at the network trace in dev tools with JS enabled, especially where the response actually contains eg JSON that includes the video URL.
Also see #24336 and the resurrected discussion in this issue https://github.com/ytdl-org/youtube-dl/issues/15960#issuecomment-964633552. The taxidi-sto-potami video is giving me 404, but the video from the linked comment works OK.
So here's a simple proof of concept ertflix.py (it needs to be in the extractor directory and imported in extractors.py
):
# coding: utf-8
from __future__ import unicode_literals
from .common import InfoExtractor
class ERTFlixIE(InfoExtractor):
_VALID_URL = r'https?://www\.ertflix\.gr/series/ser\.(?P<num_id>\d+)-(?P<id>[\w-]+)'
_TESTS = [{
'url': 'https://www.ertflix.gr/series/ser.3448-monogramma',
'md5': '82e0734bba8aa7ef526c9dd00cf35a05',
'info_dict': {
'id': 'monogramma-giannakopoulos',
'ext': 'mp4',
'title': 'md5:6b4c42bac7662390e4013b3cb1166bd3',
'description': 'md5:1a56a4d271d3de911cb083dae14e7aea',
'thumbnail': 're:https?://.+\.jpg',
},
'params': {
'format': 'bestvideo',
}
},
]
def _real_extract(self, url):
video_id = self._match_id(url)
webpage = self._download_webpage(url, video_id)
video_id = self._search_regex(r'https://files\.app\.ertflix\.gr/files/synentefxeis/%s/([\w-]+)' % (video_id, ), webpage, video_id)
title = self._og_search_title(webpage)
# instead of this magic knowledge we could use different magic knowledge to call self._download_json() on
# 'https://api.app.ertflix.gr/v1/Player/AcquireContent?platformCodename=www&codename=%s' % (video_id, ))
# and parse the result
formats = self._extract_mpd_formats(
'https://mediaserve.ert.gr/bpk-vod/vodext/default/%(video_id)s/%(video_id)s/index.mpd' % locals(),
video_id, mpd_id='dash')
return {
'id': video_id,
'formats': formats,
'title': title,
'description': self._og_search_description(webpage),
'thumbnail': self._og_search_thumbnail(webpage),
}
Thanks for the code, I'll try to understand it. Ertflix has changed their UI a couple of times over the last year (it wasn't always JS) but there might be something we can catch in the DOM.
So, I'm fooling around with https://www.ertflix.gr/vod/vod.173258-aoratoi-ergates. aoratoi-ergates
is greek for ghost-workers
.
The original DOM (attached at bottom of comment) doesn't have much, but might have just enough
"codenameToId":{"aoratoi-ergates":"vod.173258"}
. I believe it is good news that it only appears onceRunning curl 'https://api.app.ertflix.gr/v1/Player/AcquireContent?platformCodename=www&codename=aoratoi-ergates'
you'll get a response like the follow. Pay attention to the codename
in the curl url, which is identical to the codenametoid
.
{
"MediaFiles": [
{
"$type": "Insys.Video.Contracts.Messages.MediaFileResult, Insys.Video.Contracts",
"Id": 77579,
"MediaContentMediaFileId": 77678,
"RoleCodename": "main",
"RoleName": "main",
"Formats": [
{
"Id": 96983,
"Url": "https://mediaserve.ert.gr/bpk-vod/vodext/default/ghost_workers/ghost_workers/index.m3u8",
"Type": 2,
"Protection": 0,
"AudioBitrate": "0",
"VideoBitrate": "0",
"MediaTracks": [
{
"Type": "Audio",
"Index": 0,
"Name": "audio",
"DisplayName": "audio",
"ManifestId": "audio",
"IsVisible": true
},
{
"Type": "Audio",
"Index": 1,
"Name": "audio",
"DisplayName": "audio",
"ManifestId": "audio",
"IsVisible": true
},
{
"Type": "Audio",
"Index": 2,
"Name": "audio",
"DisplayName": "audio",
"ManifestId": "audio",
"IsVisible": true
},
{
"Type": "Audio",
"Index": 3,
"Name": "audio",
"DisplayName": "audio",
"ManifestId": "audio",
"IsVisible": true
},
{
"Type": "Audio",
"Index": 4,
"Name": "audio",
"DisplayName": "audio",
"ManifestId": "audio",
"IsVisible": true
},
{
"Type": "Video",
"Codec": "mp4a.40.2,avc1.4D401F",
"Bitrate": 643000,
"Index": 5,
"Name": "360p",
"DisplayName": "360p",
"ManifestId": "360p",
"IsVisible": true
},
{
"Type": "Video",
"Codec": "mp4a.40.2,avc1.4D401F",
"Bitrate": 1233000,
"Index": 6,
"Name": "432p",
"DisplayName": "432p",
"ManifestId": "432p",
"IsVisible": true
},
{
"Type": "Video",
"Codec": "mp4a.40.2,avc1.640029",
"Bitrate": 2400000,
"Index": 7,
"Name": "720p",
"DisplayName": "720p",
"ManifestId": "720p",
"IsVisible": true
},
{
"Type": "Video",
"Codec": "mp4a.40.2,avc1.640029",
"Bitrate": 3515000,
"Index": 8,
"Name": "1080p",
"DisplayName": "1080p",
"ManifestId": "1080p",
"IsVisible": true
},
{
"Type": "Video",
"Codec": "mp4a.40.2,avc1.640029",
"Bitrate": 4702000,
"Index": 9,
"Name": "1080p",
"DisplayName": "1080p",
"ManifestId": "1080p",
"IsVisible": true
},
{
"Type": "Video",
"Codec": "mp4a.40.2",
"Bitrate": 99000,
"Index": 10,
"Name": "99000",
"DisplayName": "99000",
"ManifestId": "99000",
"IsVisible": true
},
{
"Type": "Video",
"Codec": "mp4a.40.2",
"Bitrate": 133000,
"Index": 11,
"Name": "133000",
"DisplayName": "133000",
"ManifestId": "133000",
"IsVisible": true
},
{
"Type": "Video",
"Codec": "mp4a.40.2",
"Bitrate": 201000,
"Index": 12,
"Name": "201000",
"DisplayName": "201000",
"ManifestId": "201000",
"IsVisible": true
},
{
"Type": "Video",
"Codec": "mp4a.40.2",
"Bitrate": 269000,
"Index": 13,
"Name": "269000",
"DisplayName": "269000",
"ManifestId": "269000",
"IsVisible": true
},
{
"Type": "Video",
"Codec": "mp4a.40.2",
"Bitrate": 337000,
"Index": 14,
"Name": "337000",
"DisplayName": "337000",
"ManifestId": "337000",
"IsVisible": true
}
]
},
{
"Id": 96984,
"Url": "https://mediaserve.ert.gr/bpk-vod/vodext/default/ghost_workers/ghost_workers/index.mpd",
"Type": 9,
"Protection": 0,
"AudioBitrate": "0",
"VideoBitrate": "0",
"MediaTracks": [
{
"Type": "Audio",
"Bitrate": 93374,
"Index": 0,
"MimeType": "audio/mp4",
"ManifestId": "audio=93374",
"IsVisible": true
},
{
"Type": "Video",
"Codec": "avc1.4D401F",
"Bitrate": 513000,
"Index": 1,
"Name": "360p",
"DisplayName": "360p",
"MimeType": "video/mp4",
"ManifestId": "video=513000",
"IsVisible": true
},
{
"Type": "Video",
"Codec": "avc1.4D401F",
"Bitrate": 1037000,
"Index": 2,
"Name": "432p",
"DisplayName": "432p",
"MimeType": "video/mp4",
"ManifestId": "video=1037000",
"IsVisible": true
},
{
"Type": "Video",
"Codec": "avc1.640029",
"Bitrate": 2074000,
"Index": 3,
"Name": "720p",
"DisplayName": "720p",
"MimeType": "video/mp4",
"ManifestId": "video=2074000",
"IsVisible": true
},
{
"Type": "Video",
"Codec": "avc1.640029",
"Bitrate": 3062000,
"Index": 4,
"Name": "1080p",
"DisplayName": "1080p",
"MimeType": "video/mp4",
"ManifestId": "video=3062000",
"IsVisible": true
},
{
"Type": "Video",
"Codec": "avc1.640029",
"Bitrate": 4118000,
"Index": 5,
"Name": "1080p",
"DisplayName": "1080p",
"MimeType": "video/mp4",
"ManifestId": "video=4118000",
"IsVisible": true
}
]
},
{
"Id": 96985,
"Url": "https://www.ertflix.gr/_MP4\\2021\\DOCUMENTARIES\\ghost_workers\\ghost_workers_4mbps.mp4",
"Type": 3,
"Protection": 0,
"AudioBitrate": "0",
"VideoBitrate": "0",
"MediaTracks": []
}
],
"Duration": "00:52:46",
"DurationSeconds": 3166,
"OrderNumber": 1,
"IsPaid": true
}
],
"Cap": {
"SessionId": "a96069e0-686c-4e72-824a-a32fb1efe696",
"SessionTimeoutSeconds": 91,
"CAPPublicUrl": "https://capc2.app.ertflix.gr/api/player/Ping",
"CAPIntervalSeconds": 30,
"CAPRetryCount": 1,
"CAPRetrySeconds": 1,
"CAPConnectionTimeoutSeconds": 30
},
"Signature": "27AF874A0ED4DF4C889DCA34F917CC2E0C54035B",
"AdditionalInfo": {
"age_rating": 8,
"StartWithAudioLangauge": null,
"StartWithAudioSubtype": "",
"StartWithSubtitlesEnabled": false,
"AdProtectionMarkers": null,
"GetAdverts": {
"$type": "Insys.Video.Adverts.Messages.GetAdvertsResponse, Insys.Video.Adverts.Contracts",
"AdvertItems": [
{
"Url": "https://ert.adman.gr/gbanner/?1623069914|3469/1x1&264335:=1623069914@1920x1200x32?/vast=3/testcookie1",
"AdvertType": "Preroll"
},
{
"Url": "https://ert.adman.gr/gbanner/?1623069899|3473/1x1&264335:=1623069899@1920x1200x32?/vast=3/testcookie1",
"AdvertType": "Midroll",
"TimePercentage": 51
}
],
"Result": {
"Success": true,
"Code": 0
}
},
"VmapUrl": "https://api.app.ertflix.gr/V1/Adverts/GetVmap?platformCodename=www&contentCodename=aoratoi-ergates&token=",
"ManifestManipulator": "https://api.app.ertflix.gr/v1/Player/ManipulateManifest"
},
"Result": {
"Success": true,
"Code": 0
}
}
MediaFiles -> Formats -> Url
are the m3u8, mpd and mp4. Either of them can be used in youtube-dl (I tested the m3u8 file).At the moment I do get a response with the simple curl above, but the actual browser request is the following, which has 2 more parameters: deviceKey
and t
.
curl 'https://api.app.ertflix.gr/v1/Player/AcquireContent?platformCodename=www&deviceKey=6d1482a2c35b555cc1cb8ed665b38dfd&codename=aoratoi-ergates&t=1641850505671' -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Firefox/91.0' -H 'Accept: application/json' -H 'Accept-Language: el,en-US;q=0.7,en;q=0.3' --compressed -H 'Referer: https://www.ertflix.gr/' -H 'Origin: https://www.ertflix.gr' -H 'DNT: 1' -H 'Connection: keep-alive' -H 'Sec-Fetch-Dest: empty' -H 'Sec-Fetch-Mode: cors' -H 'Sec-Fetch-Site: same-site' -H 'Pragma: no-cache' -H 'Cache-Control: no-cache' -H 'TE: trailers'
codenametoid
Hope this helps to the right direction.
PS: The above will work for single video content (eg: a movie). TV series add -s1-ep1
to the codename.
Doubtless t
is just the request time and probably deviceKey
is a GUID that the site has invented to tag your client type as analysed by the site JS.
The Αόρατοι Εργάτες page (not how it was said 2500 years ago when I studied the language) shows that it is necessary to call the API, unless the codename just has to be transformed with '-' -> '_' for the m3u8 URL.
Currently my issue is that the m3u8 playlist gives '400 Bad Request' even though it can be fetched in the browser.
Then maybe we should proceed with the mpd
or the mp4
file. From my experience, I never had an issue with the mpd
file.
I am not sure I got that:
The Αόρατοι Εργάτες page (not how it was said 2500 years ago when I studied the language) shows that it is necessary to call the API, unless the codename just has to be transformed with '-' -> '_' for the m3u8 URL.
The call to the API (without _
, just as it is from the codenameToId
) returns the correct links, with the english title, or whatever title they decided to use.
Albeit, we can't do without the API.
For instance this algorithm works on a sample of two.
Get the codename:
{"role":"photo","url":"https://files.app.ertflix.gr/files/xena-docs/ghost-workers/ghost-workers-ertflix-img.jpg","isMain":true}
-ertflix-img
is at the end of the codename, strip it to get the codename and change any -
to _
to get the codename variant to be used for constructing media URLs.Then, construct the m3u8 URL from the codename variant and try to extract from it; construct the mpd URL and try that.
Like this (NB the regex tweaked to not match a 2-digit penultimate path component and match only {... "isMain":true ...}):
# coding: utf-8
from __future__ import unicode_literals
from .common import InfoExtractor
class ERTFlixIE(InfoExtractor):
_VALID_URL = r'https?://www\.ertflix\.gr/(?:series/ser|vod/vod)\.(?P<num_id>\d+)-(?P<id>[\w-]+)'
_TESTS = [{
'url': 'https://www.ertflix.gr/series/ser.3448-monogramma',
'md5': '9e87e3cba1ed955c23c73173d1df4867',
'info_dict': {
'id': 'monogramma-giannakopoulos',
'ext': 'mp4',
'title': 'md5:6b4c42bac7662390e4013b3cb1166bd3',
'description': 'md5:1a56a4d271d3de911cb083dae14e7aea',
'thumbnail': 're:https?://.+\.jpg',
},
},
]
def _real_extract(self, url):
video_id = self._match_id(url)
webpage = self._download_webpage(url, video_id)
video_id = self._search_regex(
r'(?=\{[^}]*?"isMain"\s*:\s*true\b)[^}]+?"url"\s*:\s*"https?://files\.app\.ertflix\.gr/files/[\w-]+/[\w-]{3,}/([\w-]+)\.jpg"',
webpage, video_id, default=False) or video_id
if video_id.endswith('-ertflix-img'):
video_id = video_id[:-len('-ertflix-img')]
video_url_id = video_id.replace('-', '_')
else:
video_url_id = video_id
title = self._og_search_title(webpage)
# instead of this magic knowledge we could use different magic knowledge to call self._download_json() on
# 'https://api.app.ertflix.gr/v1/Player/AcquireContent?platformCodename=www&codename=%s' % (video_id, ))
# and parse the result
formats = self._extract_m3u8_formats(
'https://mediaserve.ert.gr/bpk-vod/vodext/default/%(video_url_id)s/%(video_url_id)s/index.m3u8' % locals(),
video_id, m3u8_id='hls', ext='mp4', entry_protocol='m3u8_native', fatal=False)
formats.extend(self._extract_mpd_formats(
'https://mediaserve.ert.gr/bpk-vod/vodext/default/%(video_url_id)s/%(video_url_id)s/index.mpd' % locals(),
video_id, mpd_id='dash', fatal=False))
self._sort_formats(formats)
return {
'id': video_id,
'formats': formats,
'title': title,
'description': self._og_search_description(webpage),
'thumbnail': self._og_search_thumbnail(webpage),
}
The snippet works on some cases, thanks for that :)
Wouldn't it be better to make a call to the API and then dissect the JSON rather than constructing the links to mpd/m3u8/mp4 ?
Something like:
codename = self._match_id(url)
with urllib.request.urlopen('https://api.app.ertflix.gr/v1/Player/AcquireContent?platformCodename=www&codename=' + codename) as url:
data = json.loads(url.read().decode())
print(data)
Sucesfully gets the JSON, with the proper playlist links.
ps: I have no idea about python, I am living in the PHP world, so excuse my ignorance for python related stuff.
Yes, but it depends whether the non-working cases can be handled easily or not.
One could take the view that it's difficult for the site to change its media URLs but easy to change how those are embedded in the JSON.
Anyhow it appears that geo-restriction isn't such an issue as was feared.
In yt-dl there is a pre-defined method that can be used to get the JSON, which we can wrap like this:
def _call_api(self, video_id, **params):
json = self._download_json(
'https://api.app.ertflix.gr/v1/Player/AcquireContent',
video_id, fatal=False, query=params)
return json if isinstance(json, dict) else None
Also the API is bit more complex.
For a series (eg Μονόγραμμα in the test), the non-API hack gets the featured episode from the page.
With the API we have to get a playlist for the series by calling the Tile/GetSeriesDetails
endpoint to get JSON whose episodeGroups
member can be extracted as a dict, each of whose values includes an episodes
list, each of whose values is a metadata dict with a codename
value that can then be extracted with the Player/AcquireContent
endpoint (better).
Checklist
Example URLs
Description
WRITE DESCRIPTION HERE