Closed Leonetienne closed 1 year ago
The video element is added dynamically with JS, so yt-dl doesn't see it.
For the problem URL, instead use https://pr0gramm.com/static/5466437:
$ youtube-dl -F -v 'https://pr0gramm.com/static/5466437'
[debug] System config: ['--prefer-ffmpeg']
[debug] User config: []
[debug] Custom config: []
[debug] Command-line args: ['-F', '-v', 'https://pr0gramm.com/static/5466437']
[debug] Encodings: locale UTF-8, fs utf-8, out UTF-8, pref UTF-8
[debug] youtube-dl version 2021.12.17
[debug] Python version 3.5.2 (CPython) - Linux-4.4.0-210-generic-i686-with-Ubuntu-16.04-xenial
[debug] exe versions: avconv 4.3, avprobe 4.3, ffmpeg 4.3, ffprobe 4.3
[debug] Proxy map: {}
[generic] 5466437: Requesting header
WARNING: Falling back on generic information extractor.
[generic] 5466437: Downloading webpage
[generic] 5466437: Extracting information
[download] Downloading playlist: pr0gramm.com
[generic] playlist pr0gramm.com: Collected 1 video ids (downloading 1 of them)
[download] Downloading video 1 of 1
[info] Available formats for 5466437:
format code extension resolution note
0 mp4 unknown
[download] Finished downloading playlist: pr0gramm.com
$ youtube-dl -g 'https://pr0gramm.com/static/5466437'
WARNING: Falling back on generic information extractor.
http://img.pr0gramm.com/2022/12/21/62ae8aa5e2da0ebf.mp4
$
Although the host name is different, this appears to be the same as the video at vid.pr0gramm.com (length 1:05).
I had a go at implementing an extractor. Turns out, it it's a piece of cake. Piece of difficult cake.
The whole DOM gets constructed via javascript, and just curling the url won't return any information about the video... (e.g. the engine finds no video formats). The included javascript would actually have to be ran, like in a browser.
Any suggestions?
EDIT: did see the reply above just after submitting this comment
The video element is added dynamically with JS, so yt-dl doesn't see it.
For the problem URL, instead use https://pr0gramm.com/static/5466437:
$ youtube-dl -F -v 'https://pr0gramm.com/static/5466437' [debug] System config: ['--prefer-ffmpeg'] [debug] User config: [] [debug] Custom config: [] [debug] Command-line args: ['-F', '-v', 'https://pr0gramm.com/static/5466437'] [debug] Encodings: locale UTF-8, fs utf-8, out UTF-8, pref UTF-8 [debug] youtube-dl version 2021.12.17 [debug] Python version 3.5.2 (CPython) - Linux-4.4.0-210-generic-i686-with-Ubuntu-16.04-xenial [debug] exe versions: avconv 4.3, avprobe 4.3, ffmpeg 4.3, ffprobe 4.3 [debug] Proxy map: {} [generic] 5466437: Requesting header WARNING: Falling back on generic information extractor. [generic] 5466437: Downloading webpage [generic] 5466437: Extracting information [download] Downloading playlist: pr0gramm.com [generic] playlist pr0gramm.com: Collected 1 video ids (downloading 1 of them) [download] Downloading video 1 of 1 [info] Available formats for 5466437: format code extension resolution note 0 mp4 unknown [download] Finished downloading playlist: pr0gramm.com $ youtube-dl -g 'https://pr0gramm.com/static/5466437' WARNING: Falling back on generic information extractor. http://img.pr0gramm.com/2022/12/21/62ae8aa5e2da0ebf.mp4 $
Although the host name is different, this appears to be the same as the video at vid.pr0gramm.com (length 1:05).
Thanks, it's great that they provide a static alternative. Is it possible to "reroute" the content page in the info extractor? E.g. to f'https://pr0gramm.com/static/{video_id}'
?
Yes, though use 'https://pr0gramm.com/static/' + video_id
for yt-dl.
The basic logic will be like this:
from .generic import GenericIE
...
class Pr0grammIE(InfoExtractor):
# are all videos /new/?
_VALID_URL = r'https?://(?:www\.)?pr0gramm\.com/new/video/(?P<id>\d+)'
# TESTS = [...] !
def _real_extract(self, url):
video_id = self._match_id(url)
return self.url_result('https://pr0gramm.com/static/' + video_id`, video_id=video_id, ie=GenericIE.ie_key())
As the only available metadata is in the text under the video in the static page, you could extract that, so instead of the return
statement and the generic import:
# get the corresponding static page
webpage = self._download_webpage('https://pr0gramm.com/static/' + video_id, video_id)
formats = self._parse_html5_media_entries(url, webpage, video_id)
# this raises if there are no formats
self._sort_formats(formats)
details = self._html_search_regex(
r'</video>\s*<div\b[^>]*>([^<]+)</div>',
webpage, 'video details', fatal=False) or ''
# mung details to get timestamp and uploader
# * ensure at least a list of length 2 (also: import re)
details = re.split(r'\s+by\s+', details, 1) + [None]
# * get a date-time string of known (I hope) format
details[0] = re.sub(r'\s+-\s+', ' ', details[0])
return {
'id': video_id,
'title': 'pr0gramm video ' + video_id,
'formats': formats,
'timestamp': unified_timestamp(details[0]), # import from ..utils.py
'uploader': details[1],
}
... it's great that they provide a static alternative
According to the webpage comments this is for "shitty browsers" like yt-dl!
@dirkf Thank you so much! I think I have gotten it to work, with a lot of fiddling.
Though _parse_html5_media_entries
does not do the trick because of ...reasons. The video tag implementation is kinda strange. Anyway, I am just grabbing the media url myself and supplying it via 'url'.
I will do a bit more testing and other steps required by the guide, and will prepare a PR.
@dirkf Thank you so much! I think I have gotten it to work, with a lot of fiddling.
Though
_parse_html5_media_entries
does not do the trick because of ...reasons. The video tag implementation is kinda strange. Anyway, I am just grabbing the media url myself and supplying it via 'url'.I will do a bit more testing and other steps required by the guide, and will prepare a PR.
why not just right click on the video and click "save as" to download the video ??
why not just right click on the video and click "save as" to download the video ??
yt-dl clearly supports downloading videos which are just monolithic video files playable with a standard html5 video player, be it explicitly like here, or by-default.
Why don't you ask this question the one who first implemented this functionality? I am just extending on it.
... Though _parse_html5_media_entries does not do the trick because of ...reasons.
No, because it returns a list of entries like a playlist whereas the manifest extractors (_extract_xxx_formats()
) return a list of formats (bah!). The entries are half-baked as they never have keys beyond 'formats', 'subtitles', 'thumbnail'. So:
# get the corresponding static page
webpage = self._download_webpage('https://pr0gramm.com/static/' + video_id, video_id)
entries = self._parse_html5_media_entries(url, webpage, video_id)
media_info = entries[0]
# this raises if there are no formats
self._sort_formats(media_info.get('formats') or [])
details = self._html_search_regex(
r'</video>\s*<div\b[^>]*>([^<]+)</div>',
webpage, 'video details', fatal=False) or ''
# mung details to get timestamp and uploader
# * ensure at least a list of length 2 (also: import re)
details = re.split(r'\s+by\s+', details, 1) + [None]
# * get a date-time string of known (I hope) format
details[0] = re.sub(r'\s+-\s+', ' ', details[0])
return merge_dicts({ # import from ..utils.py
'id': video_id,
'title': 'pr0gramm video ' + video_id,
'timestamp': unified_timestamp(details[0]), # import from ..utils.py
'uploader': details[1],
}, media_info)
@dirkf Thanks, this is working flawlessly!
I've added test values, ran them in a few python versions, checked linting with flake8 and the contrib-guide, and submitted a PR.
Example URLs
Single video: https://pr0gramm.com/new/video/5466437
Description
The site is kinda like reddit or 9gag. Building an extractor shouldn't be that complicated, since the videos are just mp4 files which are linked to right in the html source. In this case, it's https://vid.pr0gramm.com/2022/12/21/62ae8aa5e2da0ebf.mp4.
See: