pornhub gif (actually short webm video) download from (https://www.pornhub.com/gif/) #31176

mo-han · 2022-08-17T05:55:13Z

Checklist

I'm reporting a site feature request
I've verified that I'm running youtube-dl version 2021.12.17
I've searched the bugtracker for similar site feature requests including closed ones

Description

youtube-dl treat the /gif/*** path URL as playlist and tries to download the "playlist" but nothing is downloaded.

The text was updated successfully, but these errors were encountered:

dirkf · 2022-08-17T12:49:08Z

Please:

example URL
verbose log.

mo-han · 2022-08-18T01:08:26Z

youtube-dl -vv https://www.pornhub.com/gif/38435321
[debug] System config: []
[debug] User config: []
[debug] Custom config: []
[debug] Command-line args: ['-vv', 'https://www.pornhub.com/gif/38435321']
[debug] Encodings: locale UTF-8, fs utf-8, out UTF-8, pref UTF-8
[debug] youtube-dl version 2021.12.17
[debug] Python version 3.6.9 (CPython) - Linux-4.15.0-188-generic-x86_64-with-Ubuntu-18.04-bionic
[debug] exe versions: ffmpeg 3.4.11, ffprobe 3.4.11
[debug] Proxy map: {}
[download] Downloading playlist: gif/38435321
[PornHubPagedVideoList] gif/38435321: Downloading page 1
[PornHubPagedVideoList] playlist gif/38435321: Downloading 0 videos
[download] Finished downloading playlist: gif/38435321

dirkf · 2022-08-19T18:00:13Z

The page seen by yt-dl has these video elements:

...
      <meta name="twitter:player:stream" content="https://dl.phncdn.com/pics/gifs/038/435/321/38435321a.webm">
        <meta name="twitter:player:stream:content_type" content="video/webm">
      <meta name="twitter:player:stream" content="https://dl.phncdn.com/pics/gifs/038/435/321/38435321a.mp4">
        <meta name="twitter:player:stream:content_type" content="video/mp4">
      <meta name="twitter:player:width" content="1280">
      <meta name="twitter:player:height" content="720">
...
    <script type="application/ld+json">
            {
                "@context": "http://schema.org/",
                "@type": "VideoObject",
                "name": "leolulu intro 1",
                "description": "Check out leolulu intro 1 porn gif with Leolulu&comma; Threesome from video We were just trying to shoot a morning sex scene in the kitchen&period;&period;&period; Amateur Couple LeoLulu on Pornhub&period;com",
                "contentUrl": "https://dl.phncdn.com/pics/gifs/038/435/321/38435321a.webm",
                "thumbnailUrl": "https://dl.phncdn.com/gif/38435321.gif",
                "uploadDate": "2021-11-22"
            }
...
            <div
                id="js-gifToWebm"
                class="centerImage notModal"
                data-gif="https://dl.phncdn.com/gif/38435321.gif"
                data-mp4="https://dl.phncdn.com/pics/gifs/038/435/321/38435321a.mp4"
                data-webm="https://dl.phncdn.com/pics/gifs/038/435/321/38435321a.webm"
                data-gif-title="leolulu intro 1"
                data-fallback="https://dl.phncdn.com/pics/gifs/038/435/321/38435321a.mp4"
            >

That's 2 instances of the .mp4, 3 of the target .webm, and 2 of the .gif.

First we need to prevent the wrong extractor from running by changing the URL pattern at l.636 of extractor/pornhub.py:

 class PornHubPagedVideoListIE(PornHubPagedPlaylistBaseIE):
-    _VALID_URL = r'https?://(?:[^/]+\.)?%s/(?P<id>(?:[^/]+/)*[^/?#&]+)' % PornHubBaseIE._PORNHUB_HOST_RE
+    _VALID_URL = r'https?://(?:[^/]+\.)?%s/(?!playlist/|gif/)(?P<id>(?:[^/]+/)*[^/?#&]+)' % PornHubBaseIE._PORNHUB_HOST_RE
     _TESTS = [{

Then the problem page is handled by the generic extractor which finds the .webm, presumably from the second (ld+json script element) group:

$ python3.9 -m youtube_dl -v -F 'https://www.pornhub.com/gif/38435321'
[debug] System config: ['--prefer-ffmpeg']
[debug] User config: []
[debug] Custom config: []
[debug] Command-line args: ['-v', '-F', 'https://www.pornhub.com/gif/38435321']
[debug] Encodings: locale UTF-8, fs utf-8, out utf-8, pref UTF-8
[debug] youtube-dl version 2021.12.17
[debug] Git HEAD: 46b8ae2f5
[debug] Python version 3.9.13 (CPython) - Linux-4.4.0-210-generic-i686-with-glibc2.23
[debug] exe versions: avconv 4.3, avprobe 4.3, ffmpeg 4.3, ffprobe 4.3
[debug] Proxy map: {}
[generic] 38435321: Requesting header
WARNING: Falling back on generic information extractor.
[generic] 38435321: Downloading webpage
[generic] 38435321: Extracting information
[info] Available formats for 38435321:
format code  extension  resolution note
0            webm       unknown    
$

This also finds a reasonable set of metadata:

{
  ...
  "title": "leolulu intro 1",
  "description": "Check out leolulu intro 1 porn gif with Leolulu, Threesome from video We were just trying to shoot a morning sex scene in the kitchen... Amateur Couple LeoLulu on Pornhub.com",
  "thumbnail": "https://dl.phncdn.com/gif/38435321.gif",
  "timestamp": 1637539200,
  "id": "38435321",
  "age_limit": 0,
  ...
  }
}

Here the age_limit is wrong. PH claims to respect the RTA labelling scheme but adds the label with script. The page yt-dl sees doesn't actually have the text that it looks for according to the RTA scheme.

Some options:

make a special extractor for this URL pattern, which could also extract the mp4 format
prepare a list of "adult" domains by extracting the maximum age_limit for each domain from the extractor test cases
extend the list AGE_MARKERS in the generic extractor.

Taking the last option, the page contains a link with id="RTAImage" and a link with text 2257 (18 U.S.C. §2257 is the US law that porn performers' ages have to be recorded).

This change catches both, but maybe the 2257 pattern will give too many false positives:

--- old/youtube_dl/extractor/generic.py
+++ new/youtube_dl/extractor/generic.py
@@ -2538,9 +2538,11 @@ class GenericIE(InfoExtractor):
         age_limit = self._rta_search(webpage)
         # And then there are the jokers who advertise that they use RTA,
         # but actually don't.
-        AGE_LIMIT_MARKERS = [
-            r'Proudly Labeled <a href="http://www\.rtalabel\.org/" title="Restricted to Adults">RTA</a>',
-        ]
+        AGE_LIMIT_MARKERS = (
+            r'<a\b[^>]+\bhref\s*=\s*"http://www\.rtalabel\.org/"[^>]+?(?:\btitle\s*=\s*"Restricted to Adults\b|>\s*RTA\b)',
+            r'''<img\b[^>]+\b(?:id\s*=["']RTAImage|alt\s*=\s*["']RTA)\b''',
+            r'(?:>\s*(?:(?:18\s+)?(?:U.S.C.|USC)\s+)?§?|/)2257\b',
+        )
         if any(re.search(marker, webpage) for marker in AGE_LIMIT_MARKERS):
             age_limit = 18

* resolves ytdl-org#31176

dirkf added a commit to dirkf/youtube-dl that referenced this issue Aug 20, 2022

[PornHub] Also block gif/ URLs from PornHubPagedVideoListIE

026da96

* resolves ytdl-org#31176

dirkf added the broken-IE problem with existing site extraction label Aug 20, 2022

dirkf added a commit to dirkf/youtube-dl that referenced this issue Aug 20, 2022

[PornHub] Also block gif/ URLs from PornHubPagedVideoListIE

45b00c3

* resolves ytdl-org#31176

dirkf added a commit to dirkf/youtube-dl that referenced this issue Aug 20, 2022

[PornHub] Also block gif/ URLs from PornHubPagedVideoListIE

d2344de

* resolves ytdl-org#31176

dirkf added a commit to dirkf/youtube-dl that referenced this issue Aug 24, 2022

[PornHub] Also block gif/ URLs from PornHubPagedVideoListIE

d9e0243

* resolves ytdl-org#31176

spirillen mentioned this issue Feb 6, 2025

m-pornhub.com mypdns/matrix#76419

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pornhub gif (actually short webm video) download from (https://www.pornhub.com/gif/) #31176

pornhub gif (actually short webm video) download from (https://www.pornhub.com/gif/) #31176

mo-han commented Aug 17, 2022 •

edited

Loading

dirkf commented Aug 17, 2022

mo-han commented Aug 18, 2022

dirkf commented Aug 19, 2022 •

edited

Loading

pornhub gif (actually short webm video) download from (https://www.pornhub.com/gif/) #31176

pornhub gif (actually short webm video) download from (https://www.pornhub.com/gif/) #31176

Comments

mo-han commented Aug 17, 2022 • edited Loading

Checklist

Description

dirkf commented Aug 17, 2022

mo-han commented Aug 18, 2022

dirkf commented Aug 19, 2022 • edited Loading

mo-han commented Aug 17, 2022 •

edited

Loading

dirkf commented Aug 19, 2022 •

edited

Loading