Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pornhub gif (actually short webm video) download from (https://www.pornhub.com/gif/) #31176

Open
3 tasks done
mo-han opened this issue Aug 17, 2022 · 3 comments
Open
3 tasks done
Labels
broken-IE problem with existing site extraction

Comments

@mo-han
Copy link

mo-han commented Aug 17, 2022

Checklist

  • I'm reporting a site feature request
  • I've verified that I'm running youtube-dl version 2021.12.17
  • I've searched the bugtracker for similar site feature requests including closed ones

Description

youtube-dl treat the /gif/*** path URL as playlist and tries to download the "playlist" but nothing is downloaded.

@dirkf
Copy link
Contributor

dirkf commented Aug 17, 2022

Please:

  • example URL
  • verbose log.

@mo-han
Copy link
Author

mo-han commented Aug 18, 2022

youtube-dl -vv https://www.pornhub.com/gif/38435321
[debug] System config: []
[debug] User config: []
[debug] Custom config: []
[debug] Command-line args: ['-vv', 'https://www.pornhub.com/gif/38435321']
[debug] Encodings: locale UTF-8, fs utf-8, out UTF-8, pref UTF-8
[debug] youtube-dl version 2021.12.17
[debug] Python version 3.6.9 (CPython) - Linux-4.15.0-188-generic-x86_64-with-Ubuntu-18.04-bionic
[debug] exe versions: ffmpeg 3.4.11, ffprobe 3.4.11
[debug] Proxy map: {}
[download] Downloading playlist: gif/38435321
[PornHubPagedVideoList] gif/38435321: Downloading page 1
[PornHubPagedVideoList] playlist gif/38435321: Downloading 0 videos
[download] Finished downloading playlist: gif/38435321

@dirkf
Copy link
Contributor

dirkf commented Aug 19, 2022

The page seen by yt-dl has these video elements:

...
      <meta name="twitter:player:stream" content="https://dl.phncdn.com/pics/gifs/038/435/321/38435321a.webm">
        <meta name="twitter:player:stream:content_type" content="video/webm">
      <meta name="twitter:player:stream" content="https://dl.phncdn.com/pics/gifs/038/435/321/38435321a.mp4">
        <meta name="twitter:player:stream:content_type" content="video/mp4">
      <meta name="twitter:player:width" content="1280">
      <meta name="twitter:player:height" content="720">
...
    <script type="application/ld+json">
            {
                "@context": "http://schema.org/",
                "@type": "VideoObject",
                "name": "leolulu intro 1",
                "description": "Check out leolulu intro 1 porn gif with Leolulu&comma; Threesome from video We were just trying to shoot a morning sex scene in the kitchen&period;&period;&period; Amateur Couple LeoLulu on Pornhub&period;com",
                "contentUrl": "https://dl.phncdn.com/pics/gifs/038/435/321/38435321a.webm",
                "thumbnailUrl": "https://dl.phncdn.com/gif/38435321.gif",
                "uploadDate": "2021-11-22"
            }
...
            <div
                id="js-gifToWebm"
                class="centerImage notModal"
                data-gif="https://dl.phncdn.com/gif/38435321.gif"
                data-mp4="https://dl.phncdn.com/pics/gifs/038/435/321/38435321a.mp4"
                data-webm="https://dl.phncdn.com/pics/gifs/038/435/321/38435321a.webm"
                data-gif-title="leolulu intro 1"
                data-fallback="https://dl.phncdn.com/pics/gifs/038/435/321/38435321a.mp4"
            >

That's 2 instances of the .mp4, 3 of the target .webm, and 2 of the .gif.

First we need to prevent the wrong extractor from running by changing the URL pattern at l.636 of extractor/pornhub.py:

 class PornHubPagedVideoListIE(PornHubPagedPlaylistBaseIE):
-    _VALID_URL = r'https?://(?:[^/]+\.)?%s/(?P<id>(?:[^/]+/)*[^/?#&]+)' % PornHubBaseIE._PORNHUB_HOST_RE
+    _VALID_URL = r'https?://(?:[^/]+\.)?%s/(?!playlist/|gif/)(?P<id>(?:[^/]+/)*[^/?#&]+)' % PornHubBaseIE._PORNHUB_HOST_RE
     _TESTS = [{

Then the problem page is handled by the generic extractor which finds the .webm, presumably from the second (ld+json script element) group:

$ python3.9 -m youtube_dl -v -F 'https://www.pornhub.com/gif/38435321'
[debug] System config: ['--prefer-ffmpeg']
[debug] User config: []
[debug] Custom config: []
[debug] Command-line args: ['-v', '-F', 'https://www.pornhub.com/gif/38435321']
[debug] Encodings: locale UTF-8, fs utf-8, out utf-8, pref UTF-8
[debug] youtube-dl version 2021.12.17
[debug] Git HEAD: 46b8ae2f5
[debug] Python version 3.9.13 (CPython) - Linux-4.4.0-210-generic-i686-with-glibc2.23
[debug] exe versions: avconv 4.3, avprobe 4.3, ffmpeg 4.3, ffprobe 4.3
[debug] Proxy map: {}
[generic] 38435321: Requesting header
WARNING: Falling back on generic information extractor.
[generic] 38435321: Downloading webpage
[generic] 38435321: Extracting information
[info] Available formats for 38435321:
format code  extension  resolution note
0            webm       unknown    
$

This also finds a reasonable set of metadata:

{
  ...
  "title": "leolulu intro 1",
  "description": "Check out leolulu intro 1 porn gif with Leolulu, Threesome from video We were just trying to shoot a morning sex scene in the kitchen... Amateur Couple LeoLulu on Pornhub.com",
  "thumbnail": "https://dl.phncdn.com/gif/38435321.gif",
  "timestamp": 1637539200,
  "id": "38435321",
  "age_limit": 0,
  ...
  }
}

Here the age_limit is wrong. PH claims to respect the RTA labelling scheme but adds the label with script. The page yt-dl sees doesn't actually have the text that it looks for according to the RTA scheme.

Some options:

  • make a special extractor for this URL pattern, which could also extract the mp4 format
  • prepare a list of "adult" domains by extracting the maximum age_limit for each domain from the extractor test cases
  • extend the list AGE_MARKERS in the generic extractor.

Taking the last option, the page contains a link with id="RTAImage" and a link with text 2257 (18 U.S.C. §2257 is the US law that porn performers' ages have to be recorded).

This change catches both, but maybe the 2257 pattern will give too many false positives:

--- old/youtube_dl/extractor/generic.py
+++ new/youtube_dl/extractor/generic.py
@@ -2538,9 +2538,11 @@ class GenericIE(InfoExtractor):
         age_limit = self._rta_search(webpage)
         # And then there are the jokers who advertise that they use RTA,
         # but actually don't.
-        AGE_LIMIT_MARKERS = [
-            r'Proudly Labeled <a href="http://www\.rtalabel\.org/" title="Restricted to Adults">RTA</a>',
-        ]
+        AGE_LIMIT_MARKERS = (
+            r'<a\b[^>]+\bhref\s*=\s*"http://www\.rtalabel\.org/"[^>]+?(?:\btitle\s*=\s*"Restricted to Adults\b|>\s*RTA\b)',
+            r'''<img\b[^>]+\b(?:id\s*=["']RTAImage|alt\s*=\s*["']RTA)\b''',
+            r'(?:>\s*(?:(?:18\s+)?(?:U.S.C.|USC)\s+)?§?|/)2257\b',
+        )
         if any(re.search(marker, webpage) for marker in AGE_LIMIT_MARKERS):
             age_limit = 18

dirkf added a commit to dirkf/youtube-dl that referenced this issue Aug 20, 2022
@dirkf dirkf added the broken-IE problem with existing site extraction label Aug 20, 2022
dirkf added a commit to dirkf/youtube-dl that referenced this issue Aug 20, 2022
dirkf added a commit to dirkf/youtube-dl that referenced this issue Aug 20, 2022
dirkf added a commit to dirkf/youtube-dl that referenced this issue Aug 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
broken-IE problem with existing site extraction
Projects
None yet
Development

No branches or pull requests

2 participants