ytdl-org / youtube-dl

Command-line program to download videos from YouTube.com and other video sites
http://ytdl-org.github.io/youtube-dl/
The Unlicense
131.96k stars 10.01k forks source link

pr0gramm.com #31433

Closed Leonetienne closed 1 year ago

Leonetienne commented 1 year ago

Example URLs

Single video: https://pr0gramm.com/new/video/5466437

Description

The site is kinda like reddit or 9gag. Building an extractor shouldn't be that complicated, since the videos are just mp4 files which are linked to right in the html source. In this case, it's https://vid.pr0gramm.com/2022/12/21/62ae8aa5e2da0ebf.mp4.

See:

<video class="item-image-actual" draggable="true" src="//vid.pr0gramm.com/2022/12/21/62ae8aa5e2da0ebf.mp4" type="video/mp4" loop="" autoplay="" preload="auto" style="width: 920px; height: 517px;">
</video>
dirkf commented 1 year ago

The video element is added dynamically with JS, so yt-dl doesn't see it.

For the problem URL, instead use https://pr0gramm.com/static/5466437:

$ youtube-dl -F -v 'https://pr0gramm.com/static/5466437'
[debug] System config: ['--prefer-ffmpeg']
[debug] User config: []
[debug] Custom config: []
[debug] Command-line args: ['-F', '-v', 'https://pr0gramm.com/static/5466437']
[debug] Encodings: locale UTF-8, fs utf-8, out UTF-8, pref UTF-8
[debug] youtube-dl version 2021.12.17
[debug] Python version 3.5.2 (CPython) - Linux-4.4.0-210-generic-i686-with-Ubuntu-16.04-xenial
[debug] exe versions: avconv 4.3, avprobe 4.3, ffmpeg 4.3, ffprobe 4.3
[debug] Proxy map: {}
[generic] 5466437: Requesting header
WARNING: Falling back on generic information extractor.
[generic] 5466437: Downloading webpage
[generic] 5466437: Extracting information
[download] Downloading playlist: pr0gramm.com
[generic] playlist pr0gramm.com: Collected 1 video ids (downloading 1 of them)
[download] Downloading video 1 of 1
[info] Available formats for 5466437:
format code  extension  resolution note
0            mp4        unknown    
[download] Finished downloading playlist: pr0gramm.com
$ youtube-dl -g 'https://pr0gramm.com/static/5466437'
WARNING: Falling back on generic information extractor.
http://img.pr0gramm.com/2022/12/21/62ae8aa5e2da0ebf.mp4
$

Although the host name is different, this appears to be the same as the video at vid.pr0gramm.com (length 1:05).

Leonetienne commented 1 year ago

I had a go at implementing an extractor. Turns out, it it's a piece of cake. Piece of difficult cake.

The whole DOM gets constructed via javascript, and just curling the url won't return any information about the video... (e.g. the engine finds no video formats). The included javascript would actually have to be ran, like in a browser.

Any suggestions?
EDIT: did see the reply above just after submitting this comment

Leonetienne commented 1 year ago

The video element is added dynamically with JS, so yt-dl doesn't see it.

For the problem URL, instead use https://pr0gramm.com/static/5466437:

$ youtube-dl -F -v 'https://pr0gramm.com/static/5466437'
[debug] System config: ['--prefer-ffmpeg']
[debug] User config: []
[debug] Custom config: []
[debug] Command-line args: ['-F', '-v', 'https://pr0gramm.com/static/5466437']
[debug] Encodings: locale UTF-8, fs utf-8, out UTF-8, pref UTF-8
[debug] youtube-dl version 2021.12.17
[debug] Python version 3.5.2 (CPython) - Linux-4.4.0-210-generic-i686-with-Ubuntu-16.04-xenial
[debug] exe versions: avconv 4.3, avprobe 4.3, ffmpeg 4.3, ffprobe 4.3
[debug] Proxy map: {}
[generic] 5466437: Requesting header
WARNING: Falling back on generic information extractor.
[generic] 5466437: Downloading webpage
[generic] 5466437: Extracting information
[download] Downloading playlist: pr0gramm.com
[generic] playlist pr0gramm.com: Collected 1 video ids (downloading 1 of them)
[download] Downloading video 1 of 1
[info] Available formats for 5466437:
format code  extension  resolution note
0            mp4        unknown    
[download] Finished downloading playlist: pr0gramm.com
$ youtube-dl -g 'https://pr0gramm.com/static/5466437'
WARNING: Falling back on generic information extractor.
http://img.pr0gramm.com/2022/12/21/62ae8aa5e2da0ebf.mp4
$

Although the host name is different, this appears to be the same as the video at vid.pr0gramm.com (length 1:05).

Thanks, it's great that they provide a static alternative. Is it possible to "reroute" the content page in the info extractor? E.g. to f'https://pr0gramm.com/static/{video_id}'?

dirkf commented 1 year ago

Yes, though use 'https://pr0gramm.com/static/' + video_id for yt-dl.

The basic logic will be like this:

from .generic import GenericIE
...
class Pr0grammIE(InfoExtractor):
    # are all videos /new/?
    _VALID_URL = r'https?://(?:www\.)?pr0gramm\.com/new/video/(?P<id>\d+)'
    # TESTS = [...] !

    def _real_extract(self, url):
        video_id = self._match_id(url)
        return self.url_result('https://pr0gramm.com/static/' + video_id`, video_id=video_id, ie=GenericIE.ie_key())

As the only available metadata is in the text under the video in the static page, you could extract that, so instead of the return statement and the generic import:

        # get the corresponding static page
        webpage = self._download_webpage('https://pr0gramm.com/static/' + video_id, video_id)
        formats = self._parse_html5_media_entries(url, webpage, video_id)
        # this raises if there are no formats
        self._sort_formats(formats)

        details = self._html_search_regex(
            r'</video>\s*<div\b[^>]*>([^<]+)</div>',
            webpage, 'video details', fatal=False) or ''
        # mung details to get timestamp and uploader
        # * ensure at least a list of length 2 (also: import re) 
        details = re.split(r'\s+by\s+', details, 1) + [None]
        # * get a date-time string of known (I hope) format
        details[0] = re.sub(r'\s+-\s+', ' ', details[0])

        return {
            'id': video_id,
            'title': 'pr0gramm video ' + video_id,
            'formats': formats,
            'timestamp': unified_timestamp(details[0]), # import from ..utils.py
            'uploader': details[1],
        }

... it's great that they provide a static alternative

According to the webpage comments this is for "shitty browsers" like yt-dl!

Leonetienne commented 1 year ago

@dirkf Thank you so much! I think I have gotten it to work, with a lot of fiddling.

Though _parse_html5_media_entries does not do the trick because of ...reasons. The video tag implementation is kinda strange. Anyway, I am just grabbing the media url myself and supplying it via 'url'.

I will do a bit more testing and other steps required by the guide, and will prepare a PR.

october262 commented 1 year ago

@dirkf Thank you so much! I think I have gotten it to work, with a lot of fiddling.

Though _parse_html5_media_entries does not do the trick because of ...reasons. The video tag implementation is kinda strange. Anyway, I am just grabbing the media url myself and supplying it via 'url'.

I will do a bit more testing and other steps required by the guide, and will prepare a PR.

why not just right click on the video and click "save as" to download the video ??

Leonetienne commented 1 year ago

why not just right click on the video and click "save as" to download the video ??

yt-dl clearly supports downloading videos which are just monolithic video files playable with a standard html5 video player, be it explicitly like here, or by-default.
Why don't you ask this question the one who first implemented this functionality? I am just extending on it.

dirkf commented 1 year ago

... Though _parse_html5_media_entries does not do the trick because of ...reasons.

No, because it returns a list of entries like a playlist whereas the manifest extractors (_extract_xxx_formats()) return a list of formats (bah!). The entries are half-baked as they never have keys beyond 'formats', 'subtitles', 'thumbnail'. So:


        # get the corresponding static page
        webpage = self._download_webpage('https://pr0gramm.com/static/' + video_id, video_id)
        entries = self._parse_html5_media_entries(url, webpage, video_id)
        media_info = entries[0]
        # this raises if there are no formats
        self._sort_formats(media_info.get('formats') or [])

        details = self._html_search_regex(
            r'</video>\s*<div\b[^>]*>([^<]+)</div>',
            webpage, 'video details', fatal=False) or ''
        # mung details to get timestamp and uploader
        # * ensure at least a list of length 2 (also: import re) 
        details = re.split(r'\s+by\s+', details, 1) + [None]
        # * get a date-time string of known (I hope) format
        details[0] = re.sub(r'\s+-\s+', ' ', details[0])

        return merge_dicts({ # import from ..utils.py
            'id': video_id,
            'title': 'pr0gramm video ' + video_id,
            'timestamp': unified_timestamp(details[0]), # import from ..utils.py
            'uploader': details[1],
        }, media_info)
Leonetienne commented 1 year ago

@dirkf Thanks, this is working flawlessly!

I've added test values, ran them in a few python versions, checked linting with flake8 and the contrib-guide, and submitted a PR.