pornhub gif (actually short webm video) download from (https://www.pornhub.com/gif/)

mo-han commented 2 years ago

Checklist

[x] I'm reporting a site feature request
[x] I've verified that I'm running youtube-dl version 2021.12.17
[x] I've searched the bugtracker for similar site feature requests including closed ones

Description

youtube-dl treat the /gif/*** path URL as playlist and tries to download the "playlist" but nothing is downloaded.

dirkf commented 2 years ago

Please:

example URL
verbose log.

mo-han commented 2 years ago

youtube-dl -vv https://www.pornhub.com/gif/38435321
[debug] System config: []
[debug] User config: []
[debug] Custom config: []
[debug] Command-line args: ['-vv', 'https://www.pornhub.com/gif/38435321']
[debug] Encodings: locale UTF-8, fs utf-8, out UTF-8, pref UTF-8
[debug] youtube-dl version 2021.12.17
[debug] Python version 3.6.9 (CPython) - Linux-4.15.0-188-generic-x86_64-with-Ubuntu-18.04-bionic
[debug] exe versions: ffmpeg 3.4.11, ffprobe 3.4.11
[debug] Proxy map: {}
[download] Downloading playlist: gif/38435321
[PornHubPagedVideoList] gif/38435321: Downloading page 1
[PornHubPagedVideoList] playlist gif/38435321: Downloading 0 videos
[download] Finished downloading playlist: gif/38435321

dirkf commented 2 years ago

The page seen by yt-dl has these video elements:

...
      <meta name="twitter:player:stream" content="https://dl.phncdn.com/pics/gifs/038/435/321/38435321a.webm">
        <meta name="twitter:player:stream:content_type" content="video/webm">
      <meta name="twitter:player:stream" content="https://dl.phncdn.com/pics/gifs/038/435/321/38435321a.mp4">
        <meta name="twitter:player:stream:content_type" content="video/mp4">
      <meta name="twitter:player:width" content="1280">
      <meta name="twitter:player:height" content="720">
...
    <script type="application/ld+json">
            {
                "@context": "http://schema.org/",
                "@type": "VideoObject",
                "name": "leolulu intro 1",
                "description": "Check out leolulu intro 1 porn gif with Leolulu&comma; Threesome from video We were just trying to shoot a morning sex scene in the kitchen&period;&period;&period; Amateur Couple LeoLulu on Pornhub&period;com",
                "contentUrl": "https://dl.phncdn.com/pics/gifs/038/435/321/38435321a.webm",
                "thumbnailUrl": "https://dl.phncdn.com/gif/38435321.gif",
                "uploadDate": "2021-11-22"
            }
...
            <div
                id="js-gifToWebm"
                class="centerImage notModal"
                data-gif="https://dl.phncdn.com/gif/38435321.gif"
                data-mp4="https://dl.phncdn.com/pics/gifs/038/435/321/38435321a.mp4"
                data-webm="https://dl.phncdn.com/pics/gifs/038/435/321/38435321a.webm"
                data-gif-title="leolulu intro 1"
                data-fallback="https://dl.phncdn.com/pics/gifs/038/435/321/38435321a.mp4"
            >

That's 2 instances of the .mp4, 3 of the target .webm, and 2 of the .gif.

First we need to prevent the wrong extractor from running by changing the URL pattern at l.636 of extractor/pornhub.py:

 class PornHubPagedVideoListIE(PornHubPagedPlaylistBaseIE):
-    _VALID_URL = r'https?://(?:[^/]+\.)?%s/(?P<id>(?:[^/]+/)*[^/?#&]+)' % PornHubBaseIE._PORNHUB_HOST_RE
+    _VALID_URL = r'https?://(?:[^/]+\.)?%s/(?!playlist/|gif/)(?P<id>(?:[^/]+/)*[^/?#&]+)' % PornHubBaseIE._PORNHUB_HOST_RE
     _TESTS = [{

Then the problem page is handled by the generic extractor which finds the .webm, presumably from the second (ld+json script element) group:

$ python3.9 -m youtube_dl -v -F 'https://www.pornhub.com/gif/38435321'
[debug] System config: ['--prefer-ffmpeg']
[debug] User config: []
[debug] Custom config: []
[debug] Command-line args: ['-v', '-F', 'https://www.pornhub.com/gif/38435321']
[debug] Encodings: locale UTF-8, fs utf-8, out utf-8, pref UTF-8
[debug] youtube-dl version 2021.12.17
[debug] Git HEAD: 46b8ae2f5
[debug] Python version 3.9.13 (CPython) - Linux-4.4.0-210-generic-i686-with-glibc2.23
[debug] exe versions: avconv 4.3, avprobe 4.3, ffmpeg 4.3, ffprobe 4.3
[debug] Proxy map: {}
[generic] 38435321: Requesting header
WARNING: Falling back on generic information extractor.
[generic] 38435321: Downloading webpage
[generic] 38435321: Extracting information
[info] Available formats for 38435321:
format code  extension  resolution note
0            webm       unknown    
$

This also finds a reasonable set of metadata:

{
  ...
  "title": "leolulu intro 1",
  "description": "Check out leolulu intro 1 porn gif with Leolulu, Threesome from video We were just trying to shoot a morning sex scene in the kitchen... Amateur Couple LeoLulu on Pornhub.com",
  "thumbnail": "https://dl.phncdn.com/gif/38435321.gif",
  "timestamp": 1637539200,
  "id": "38435321",
  "age_limit": 0,
  ...
  }
}

Here the age_limit is wrong. PH claims to respect the RTA labelling scheme but adds the label with script. The page yt-dl sees doesn't actually have the text that it looks for according to the RTA scheme.

Some options:

make a special extractor for this URL pattern, which could also extract the mp4 format
prepare a list of "adult" domains by extracting the maximum age_limit for each domain from the extractor test cases
extend the list AGE_MARKERS in the generic extractor.

Taking the last option, the page contains a link with id="RTAImage" and a link with text 2257 (18 U.S.C. §2257 is the US law that porn performers' ages have to be recorded).

This change catches both, but maybe the 2257 pattern will give too many false positives:

--- old/youtube_dl/extractor/generic.py
+++ new/youtube_dl/extractor/generic.py
@@ -2538,9 +2538,11 @@ class GenericIE(InfoExtractor):
         age_limit = self._rta_search(webpage)
         # And then there are the jokers who advertise that they use RTA,
         # but actually don't.
-        AGE_LIMIT_MARKERS = [
-            r'Proudly Labeled <a href="http://www\.rtalabel\.org/" title="Restricted to Adults">RTA</a>',
-        ]
+        AGE_LIMIT_MARKERS = (
+            r'<a\b[^>]+\bhref\s*=\s*"http://www\.rtalabel\.org/"[^>]+?(?:\btitle\s*=\s*"Restricted to Adults\b|>\s*RTA\b)',
+            r'''<img\b[^>]+\b(?:id\s*=["']RTAImage|alt\s*=\s*["']RTA)\b''',
+            r'(?:>\s*(?:(?:18\s+)?(?:U.S.C.|USC)\s+)?§?|/)2257\b',
+        )
         if any(re.search(marker, webpage) for marker in AGE_LIMIT_MARKERS):
             age_limit = 18

ytdl-org / youtube-dl

pornhub gif (actually short webm video) download from (https://www.pornhub.com/gif/) #31176

Checklist

Description