ytdl-org / youtube-dl

Command-line program to download videos from YouTube.com and other video sites
http://ytdl-org.github.io/youtube-dl/
The Unlicense
131.39k stars 9.96k forks source link

www.ertflix.gr #30070

Open dimitris1962 opened 2 years ago

dimitris1962 commented 2 years ago

Checklist

Example URLs

Description

WRITE DESCRIPTION HERE

dirkf commented 2 years ago

Geo-restricted

bserem commented 2 years ago

@dirkf how could we help with that? I do have a nordvpn account which I could share and I could also get the source code or anything from an ertflix page if that helps.

dirkf commented 2 years ago

As a start, could you get the plain web page code that's downloaded in the browser with JS disabled (otherwise the page will get transformed in ways that yt-dl wouldn't see).

In Mozilla browsers, the developer tools Network tab has a context menu item for each URL to Copy>Copy as cURL, which puts a curl command in the clipboard that replicates the selected request, so you can run that with a -o page.html to generate it; or Copy>Copy Response should give you the returned page code that you could paste into a file. Either way, attach the file to your response. Ideally also capture and attach the request and response headers, eg using the Save All as HAR option.

bserem commented 2 years ago

Thanks! Ertflix depends on JS to load the mpd/m3u8 file, it does not exist in the original source code.

After the "view" button is clicked an index.mpd file gets loaded and it points to an title.m3u8 file. Both of the .mpd and .m3u8 are compatible with youtube-dl.

Apparently it is a two-step task for ertflix, not a big deal though.

ps: I did not attach the results from curl (thanks for the hints) because it is just 140kb of JS.

dirkf commented 2 years ago

Unless the video URL is somehow deducible from the original URL combined with stuff extracted from the non-JS page, we have to reverse engineer what the JS is doing and implement that in the extractor.

Further clues could be resources of type XHR fetched before the video URL when looking at the network trace in dev tools with JS enabled, especially where the response actually contains eg JSON that includes the video URL.

dirkf commented 2 years ago

Also see #24336 and the resurrected discussion in this issue https://github.com/ytdl-org/youtube-dl/issues/15960#issuecomment-964633552. The taxidi-sto-potami video is giving me 404, but the video from the linked comment works OK.

So here's a simple proof of concept ertflix.py (it needs to be in the extractor directory and imported in extractors.py):

# coding: utf-8
from __future__ import unicode_literals

from .common import InfoExtractor

class ERTFlixIE(InfoExtractor):
    _VALID_URL = r'https?://www\.ertflix\.gr/series/ser\.(?P<num_id>\d+)-(?P<id>[\w-]+)'
    _TESTS = [{
        'url': 'https://www.ertflix.gr/series/ser.3448-monogramma',
        'md5': '82e0734bba8aa7ef526c9dd00cf35a05',
        'info_dict': {
            'id': 'monogramma-giannakopoulos',
            'ext': 'mp4',
            'title': 'md5:6b4c42bac7662390e4013b3cb1166bd3',
            'description': 'md5:1a56a4d271d3de911cb083dae14e7aea',
            'thumbnail': 're:https?://.+\.jpg',
        },
        'params': {
            'format': 'bestvideo',
        }
    },
    ]

    def _real_extract(self, url):
        video_id = self._match_id(url)
        webpage = self._download_webpage(url, video_id)
        video_id = self._search_regex(r'https://files\.app\.ertflix\.gr/files/synentefxeis/%s/([\w-]+)' % (video_id, ), webpage, video_id)
        title = self._og_search_title(webpage)
        # instead of this magic knowledge we could use different magic knowledge to call self._download_json() on
        # 'https://api.app.ertflix.gr/v1/Player/AcquireContent?platformCodename=www&codename=%s' % (video_id, ))
        # and parse the result
        formats = self._extract_mpd_formats(
            'https://mediaserve.ert.gr/bpk-vod/vodext/default/%(video_id)s/%(video_id)s/index.mpd' % locals(),
            video_id, mpd_id='dash')
        return {
            'id': video_id,
            'formats': formats,
            'title': title,
            'description': self._og_search_description(webpage),
            'thumbnail': self._og_search_thumbnail(webpage),
        }
bserem commented 2 years ago

Thanks for the code, I'll try to understand it. Ertflix has changed their UI a couple of times over the last year (it wasn't always JS) but there might be something we can catch in the DOM.

bserem commented 2 years ago

So, I'm fooling around with https://www.ertflix.gr/vod/vod.173258-aoratoi-ergates. aoratoi-ergates is greek for ghost-workers. The original DOM (attached at bottom of comment) doesn't have much, but might have just enough

Findings

Notes

At the moment I do get a response with the simple curl above, but the actual browser request is the following, which has 2 more parameters: deviceKey and t.

curl 'https://api.app.ertflix.gr/v1/Player/AcquireContent?platformCodename=www&deviceKey=6d1482a2c35b555cc1cb8ed665b38dfd&codename=aoratoi-ergates&t=1641850505671' -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Firefox/91.0' -H 'Accept: application/json' -H 'Accept-Language: el,en-US;q=0.7,en;q=0.3' --compressed -H 'Referer: https://www.ertflix.gr/' -H 'Origin: https://www.ertflix.gr' -H 'DNT: 1' -H 'Connection: keep-alive' -H 'Sec-Fetch-Dest: empty' -H 'Sec-Fetch-Mode: cors' -H 'Sec-Fetch-Site: same-site' -H 'Pragma: no-cache' -H 'Cache-Control: no-cache' -H 'TE: trailers'

Todo to get ertflix in youtube-dl

Attachments

ertflix_vod_source.html.txt

Hope this helps to the right direction.

bserem commented 2 years ago

PS: The above will work for single video content (eg: a movie). TV series add -s1-ep1 to the codename.

dirkf commented 2 years ago

Doubtless t is just the request time and probably deviceKey is a GUID that the site has invented to tag your client type as analysed by the site JS.

The Αόρατοι Εργάτες page (not how it was said 2500 years ago when I studied the language) shows that it is necessary to call the API, unless the codename just has to be transformed with '-' -> '_' for the m3u8 URL.

Currently my issue is that the m3u8 playlist gives '400 Bad Request' even though it can be fetched in the browser.

vensires commented 2 years ago

Then maybe we should proceed with the mpd or the mp4 file. From my experience, I never had an issue with the mpd file.

bserem commented 2 years ago

I am not sure I got that:

The Αόρατοι Εργάτες page (not how it was said 2500 years ago when I studied the language) shows that it is necessary to call the API, unless the codename just has to be transformed with '-' -> '_' for the m3u8 URL.

The call to the API (without _, just as it is from the codenameToId) returns the correct links, with the english title, or whatever title they decided to use.

Albeit, we can't do without the API.

dirkf commented 2 years ago

For instance this algorithm works on a sample of two.

Get the codename:

Then, construct the m3u8 URL from the codename variant and try to extract from it; construct the mpd URL and try that.

Like this (NB the regex tweaked to not match a 2-digit penultimate path component and match only {... "isMain":true ...}):

# coding: utf-8
from __future__ import unicode_literals

from .common import InfoExtractor

class ERTFlixIE(InfoExtractor):
    _VALID_URL = r'https?://www\.ertflix\.gr/(?:series/ser|vod/vod)\.(?P<num_id>\d+)-(?P<id>[\w-]+)'
    _TESTS = [{
        'url': 'https://www.ertflix.gr/series/ser.3448-monogramma',
        'md5': '9e87e3cba1ed955c23c73173d1df4867',
        'info_dict': {
            'id': 'monogramma-giannakopoulos',
            'ext': 'mp4',
            'title': 'md5:6b4c42bac7662390e4013b3cb1166bd3',
            'description': 'md5:1a56a4d271d3de911cb083dae14e7aea',
            'thumbnail': 're:https?://.+\.jpg',
        },
    },
    ]

    def _real_extract(self, url):
        video_id = self._match_id(url)
        webpage = self._download_webpage(url, video_id)
        video_id = self._search_regex(
            r'(?=\{[^}]*?"isMain"\s*:\s*true\b)[^}]+?"url"\s*:\s*"https?://files\.app\.ertflix\.gr/files/[\w-]+/[\w-]{3,}/([\w-]+)\.jpg"',
            webpage, video_id, default=False) or video_id
        if video_id.endswith('-ertflix-img'):
            video_id = video_id[:-len('-ertflix-img')]
            video_url_id = video_id.replace('-', '_')
        else:
             video_url_id = video_id

        title = self._og_search_title(webpage)

        # instead of this magic knowledge we could use different magic knowledge to call self._download_json() on
        # 'https://api.app.ertflix.gr/v1/Player/AcquireContent?platformCodename=www&codename=%s' % (video_id, ))
        # and parse the result
        formats = self._extract_m3u8_formats(
                'https://mediaserve.ert.gr/bpk-vod/vodext/default/%(video_url_id)s/%(video_url_id)s/index.m3u8' % locals(),
                video_id, m3u8_id='hls', ext='mp4', entry_protocol='m3u8_native', fatal=False)

        formats.extend(self._extract_mpd_formats(
                'https://mediaserve.ert.gr/bpk-vod/vodext/default/%(video_url_id)s/%(video_url_id)s/index.mpd' % locals(),
                video_id, mpd_id='dash', fatal=False))

        self._sort_formats(formats)

        return {
            'id': video_id,
            'formats': formats,
            'title': title,
            'description': self._og_search_description(webpage),
            'thumbnail': self._og_search_thumbnail(webpage),
        }
bserem commented 2 years ago

The snippet works on some cases, thanks for that :)

Wouldn't it be better to make a call to the API and then dissect the JSON rather than constructing the links to mpd/m3u8/mp4 ?

Something like:

        codename = self._match_id(url)
        with urllib.request.urlopen('https://api.app.ertflix.gr/v1/Player/AcquireContent?platformCodename=www&codename=' + codename) as url:
          data = json.loads(url.read().decode())
          print(data)

Sucesfully gets the JSON, with the proper playlist links.

ps: I have no idea about python, I am living in the PHP world, so excuse my ignorance for python related stuff.

dirkf commented 2 years ago

Yes, but it depends whether the non-working cases can be handled easily or not.

One could take the view that it's difficult for the site to change its media URLs but easy to change how those are embedded in the JSON.

Anyhow it appears that geo-restriction isn't such an issue as was feared.

In yt-dl there is a pre-defined method that can be used to get the JSON, which we can wrap like this:

    def _call_api(self, video_id, **params):
        json = self._download_json(
            'https://api.app.ertflix.gr/v1/Player/AcquireContent',
            video_id, fatal=False, query=params)
        return json if isinstance(json, dict) else None
dirkf commented 2 years ago

Also the API is bit more complex.

For a series (eg Μονόγραμμα in the test), the non-API hack gets the featured episode from the page.

With the API we have to get a playlist for the series by calling the Tile/GetSeriesDetails endpoint to get JSON whose episodeGroups member can be extracted as a dict, each of whose values includes an episodes list, each of whose values is a metadata dict with a codename value that can then be extracted with the Player/AcquireContent endpoint (better).