ytdl-org / youtube-dl

Command-line program to download videos from YouTube.com and other video sites
http://ytdl-org.github.io/youtube-dl/
The Unlicense
132.79k stars 10.08k forks source link

pornhub gif (actually short webm video) download from (https://www.pornhub.com/gif/) #31176

Open mo-han opened 2 years ago

mo-han commented 2 years ago

Checklist

Description

youtube-dl treat the /gif/*** path URL as playlist and tries to download the "playlist" but nothing is downloaded.

dirkf commented 2 years ago

Please:

mo-han commented 2 years ago
youtube-dl -vv https://www.pornhub.com/gif/38435321
[debug] System config: []
[debug] User config: []
[debug] Custom config: []
[debug] Command-line args: ['-vv', 'https://www.pornhub.com/gif/38435321']
[debug] Encodings: locale UTF-8, fs utf-8, out UTF-8, pref UTF-8
[debug] youtube-dl version 2021.12.17
[debug] Python version 3.6.9 (CPython) - Linux-4.15.0-188-generic-x86_64-with-Ubuntu-18.04-bionic
[debug] exe versions: ffmpeg 3.4.11, ffprobe 3.4.11
[debug] Proxy map: {}
[download] Downloading playlist: gif/38435321
[PornHubPagedVideoList] gif/38435321: Downloading page 1
[PornHubPagedVideoList] playlist gif/38435321: Downloading 0 videos
[download] Finished downloading playlist: gif/38435321
dirkf commented 2 years ago

The page seen by yt-dl has these video elements:

...
      <meta name="twitter:player:stream" content="https://dl.phncdn.com/pics/gifs/038/435/321/38435321a.webm">
        <meta name="twitter:player:stream:content_type" content="video/webm">
      <meta name="twitter:player:stream" content="https://dl.phncdn.com/pics/gifs/038/435/321/38435321a.mp4">
        <meta name="twitter:player:stream:content_type" content="video/mp4">
      <meta name="twitter:player:width" content="1280">
      <meta name="twitter:player:height" content="720">
...
    <script type="application/ld+json">
            {
                "@context": "http://schema.org/",
                "@type": "VideoObject",
                "name": "leolulu intro 1",
                "description": "Check out leolulu intro 1 porn gif with Leolulu&comma; Threesome from video We were just trying to shoot a morning sex scene in the kitchen&period;&period;&period; Amateur Couple LeoLulu on Pornhub&period;com",
                "contentUrl": "https://dl.phncdn.com/pics/gifs/038/435/321/38435321a.webm",
                "thumbnailUrl": "https://dl.phncdn.com/gif/38435321.gif",
                "uploadDate": "2021-11-22"
            }
...
            <div
                id="js-gifToWebm"
                class="centerImage notModal"
                data-gif="https://dl.phncdn.com/gif/38435321.gif"
                data-mp4="https://dl.phncdn.com/pics/gifs/038/435/321/38435321a.mp4"
                data-webm="https://dl.phncdn.com/pics/gifs/038/435/321/38435321a.webm"
                data-gif-title="leolulu intro 1"
                data-fallback="https://dl.phncdn.com/pics/gifs/038/435/321/38435321a.mp4"
            >

That's 2 instances of the .mp4, 3 of the target .webm, and 2 of the .gif.

First we need to prevent the wrong extractor from running by changing the URL pattern at l.636 of extractor/pornhub.py:

 class PornHubPagedVideoListIE(PornHubPagedPlaylistBaseIE):
-    _VALID_URL = r'https?://(?:[^/]+\.)?%s/(?P<id>(?:[^/]+/)*[^/?#&]+)' % PornHubBaseIE._PORNHUB_HOST_RE
+    _VALID_URL = r'https?://(?:[^/]+\.)?%s/(?!playlist/|gif/)(?P<id>(?:[^/]+/)*[^/?#&]+)' % PornHubBaseIE._PORNHUB_HOST_RE
     _TESTS = [{

Then the problem page is handled by the generic extractor which finds the .webm, presumably from the second (ld+json script element) group:

$ python3.9 -m youtube_dl -v -F 'https://www.pornhub.com/gif/38435321'
[debug] System config: ['--prefer-ffmpeg']
[debug] User config: []
[debug] Custom config: []
[debug] Command-line args: ['-v', '-F', 'https://www.pornhub.com/gif/38435321']
[debug] Encodings: locale UTF-8, fs utf-8, out utf-8, pref UTF-8
[debug] youtube-dl version 2021.12.17
[debug] Git HEAD: 46b8ae2f5
[debug] Python version 3.9.13 (CPython) - Linux-4.4.0-210-generic-i686-with-glibc2.23
[debug] exe versions: avconv 4.3, avprobe 4.3, ffmpeg 4.3, ffprobe 4.3
[debug] Proxy map: {}
[generic] 38435321: Requesting header
WARNING: Falling back on generic information extractor.
[generic] 38435321: Downloading webpage
[generic] 38435321: Extracting information
[info] Available formats for 38435321:
format code  extension  resolution note
0            webm       unknown    
$

This also finds a reasonable set of metadata:

{
  ...
  "title": "leolulu intro 1",
  "description": "Check out leolulu intro 1 porn gif with Leolulu, Threesome from video We were just trying to shoot a morning sex scene in the kitchen... Amateur Couple LeoLulu on Pornhub.com",
  "thumbnail": "https://dl.phncdn.com/gif/38435321.gif",
  "timestamp": 1637539200,
  "id": "38435321",
  "age_limit": 0,
  ...
  }
}

Here the age_limit is wrong. PH claims to respect the RTA labelling scheme but adds the label with script. The page yt-dl sees doesn't actually have the text that it looks for according to the RTA scheme.

Some options:

Taking the last option, the page contains a link with id="RTAImage" and a link with text 2257 (18 U.S.C. §2257 is the US law that porn performers' ages have to be recorded).

This change catches both, but maybe the 2257 pattern will give too many false positives:

--- old/youtube_dl/extractor/generic.py
+++ new/youtube_dl/extractor/generic.py
@@ -2538,9 +2538,11 @@ class GenericIE(InfoExtractor):
         age_limit = self._rta_search(webpage)
         # And then there are the jokers who advertise that they use RTA,
         # but actually don't.
-        AGE_LIMIT_MARKERS = [
-            r'Proudly Labeled <a href="http://www\.rtalabel\.org/" title="Restricted to Adults">RTA</a>',
-        ]
+        AGE_LIMIT_MARKERS = (
+            r'<a\b[^>]+\bhref\s*=\s*"http://www\.rtalabel\.org/"[^>]+?(?:\btitle\s*=\s*"Restricted to Adults\b|>\s*RTA\b)',
+            r'''<img\b[^>]+\b(?:id\s*=["']RTAImage|alt\s*=\s*["']RTA)\b''',
+            r'(?:>\s*(?:(?:18\s+)?(?:U.S.C.|USC)\s+)?§?|/)2257\b',
+        )
         if any(re.search(marker, webpage) for marker in AGE_LIMIT_MARKERS):
             age_limit = 18