Reddit images+links (recursive) dump #15

Bfgeshka · 2017-05-20T16:54:13Z

What is your opinion on this, can you do it? Not requesting it directly, just asking because it can be tricky.

Plenty of subreddits are used as galleries. Comment section is no less important, so it deserves grabbing too.

Key pooint is that other modules of gallery-dl can be used for link processing.

mikf · 2017-05-20T19:25:05Z

I've looked a bit into the reddit API and

fetching all threads/posts of a subreddit
getting their respective comments
scanning these for external links/images

seems quite doable, so I will be working on implementing something like this.

If this isn't what you had in mind, then please correct me and/or provide a concrete example of what you want. Also what exactly do you mean by recursive dump?

Bfgeshka · 2017-05-20T20:20:23Z

fetching all threads/posts of a subreddit
getting their respective comments
scanning these for external links/images

Yes, that's about what I have in mind.

Also what exactly do you mean by recursive dump?

I do mean something like: if comment or post includes links to other reddit posts (not subreddits, it would be an endless mess) in different subreddit, they can be parsed as well. As optional behavior, of course.

Hrxn · 2017-05-21T01:18:58Z

And links to other reddit posts in that link to other reddit posts? 😉

What I'm trying to say, this alone would already end up in being an endless mess, without going recursively from subreddit to subreddit.

So be careful with that. And in some way, the depth must be limited when doing anything recursively.
So I strongly suggest to start simple and straightforward instead.

But in general, I think supporting reddit is a really good idea.

I'm not really a heavy user of reddit, but I know myself around on that site, so just @-mention me here and I'll help with testing and stuff.

There are fundamentally only two types of posts that can be submitted to reddit:

Text only (Which is probably not interesting at all for our case here)
Link/URL referring to an external resource.

But the comment thread that is part of a text only submission can include links to external resources such as direct links to images, etc., obviously. This also applies to the other type of post, of course.

In this second case, the big majority of all subreddits dealing with images features something like this:

The link post is a direct link to an image hosted on Imgur.
The picture is hosted by reddit (Domain: i.redd.it). Not directly linked from the subreddit itself unfortunately.
The link is to an album on Imgur (http(s)://imgur.com/a/<ID>, or imgur.com/gallery/<ID>) The ID being a 5-digit alphanumeric code.
Idiots linking to the Imgur page of a single image when they could use the direct link instead.

Many subreddits restrict hosting to only these domains, so that would be the easy part.
But other subreddits, other rules, so this doesn't apply everywhere.

The other easy variant is a post with a direct link to an image as the URL, just hosted on god knows where. But this is just a simple HTTP GET away, not really a big deal.
Sometimes, this is a direct link to a file hosted on Flickr. (xy.staticflickr.com/?????....) but sometimes it is the photo page on Flickr.
Also very popular on reddit: GIFs (huge surprise). But the real .gif image file format is on a steady decline, vastly outnumbered by "GIFV" on Imgur and links to Gfycat. In these cases, what we get is actually an MP4 file (or WebM, also available usually).

My point is that supporting all different variants of posts on reddit is a pretty big deal. So we should maybe use this space and opportunity here and now to try to reach a consensus on what would be a viable start, and a constructive way forward from there on. 😄

A link with more than enough subreddits of images for the interested:
https://www.reddit.com/r/sfwpornnetwork/wiki/network

- these extractors scan submissions and their comments for (external) URLs and defer them to other extractors - (#15)

- filter or complete some URLs - remove the 'nofollow:' scheme before printing URLs - (#15)

Hrxn · 2017-05-24T00:20:27Z

So, what is supported so far? Direct external links to images and i.redd.it, as well as URLs gallery-dl already recognizes?

mikf · 2017-05-24T08:01:45Z

First of all: thank you for that huge amount of information you provided. It has been quite helpful.

So far I've only been working on those three points I posted above, which should be working by now. The next thing to implement would be recursion and all the different modules for gfycat, flickr and so on.

The only URLs that are supported by now are the ones that are already recognized, i.e. imgur albums. I thought I had a solution for direct links, which should be easy, but that affects the recursive extractor and makes it quite less useful.

~~It it currently not possible to just enable/disable extractors on the fly, which would solve this issue, so that is another thing to work on.~~

edit: Direct links are now supported as well.

Bfgeshka · 2017-05-24T13:21:58Z

How are you going to handle generic links (articles, sites and so on)? They shouldn't be left behind as well. I do suggest optional saving of links in text file.

mikf · 2017-05-24T14:36:41Z

I hadn't originally planned to do anything special with unsupported links. They are (obviously) ignored when downloading, but you can still get them by using the -g option.

The issue here is that there is currently no distinction between supported and unsupported URLs when listing them with -g. I could either implement a filter to print supported/unsupported ones only, or prefix unsupported URLs with a '#'. Saving them to a file could then be done by

# filter between supported/unsupported
$ gallery-dl -g --unsupported [URL] > file

# prefix with '#'
$ gallery-dl -g [URL] | grep "^#" | cut -d "#" -f 2- > file

The second variant seems a bit too complex and wouldn't really be possible for Windows users as well, so it is either the first one, or, as you suggested, an option to simply write them to a file. I'm not quite sure which of these two i prefer.

Bfgeshka · 2017-05-25T08:02:17Z

$ gallery-dl -g [URL] | grep "^#" | cut -d "#" -f 2- > file

Too complicated, yeah.

$ gallery-dl -g --unsupported [URL] > file

This one is pretty good, actually.

Well, as long as all links would be listed in one way on another, it's fine.

reddit extractors now recursively visit other submissions/posts linked to in the initial set of submissions. This behaviour can be configured via the 'extractor.reddit.recursion' key in the configuration file or by `-o recursion=<value>`. Example: {"extractor": { "reddit": { "recursion": <value> }}} Possible values: * -1 - infinite recursion (don't do this) * 0 - recursion is disabled (default) * 1 and higher - maximum recursion level

Hrxn · 2017-05-26T23:45:27Z

1.
For subreddits like https://www.reddit.com/r/gifs/, usage seems to be something like 90% Imgur, 10% Gfycat and 1% redd.it.

Imgur extraction seems to work here, getting real .gif image files. Nothing wrong with that, but as you can see from the usage on reddit itself, most if not all "gifs" are using GIFV on Imgur, the URLs being like this: http://i.imgur.com/0gybAXR.gifv

And by the way, I just noticed this from directlink.py:
pattern = [r"https?://[^?&#]+\.(?:jpe?g|png|gifv?|webm|mp4)"]

Matching gifv won't be really of any use, I think. Because, as far as I know, it's not a real file format and not used anywhere. What you actually get is this: https://gist.github.com/anonymous/3d6330be06578a6ab2bb31074b8df321
_{Can be seen with using the URL (http://i.imgur.com/0gybAXR.gifv) with curl, for example.}

HTML, with a bit of JavaScript and an embedded 'video' element. The link to the 'real' file (MP4) can be seen there. (Or, in the browser: Right click > Copy video address).

So far, I know of Imgur, but I've also seen that reddit itself uses this "trick" for GIFs hosted on its i.redd.it domain.

The thing is, there is not just a difference in visual quality (MP4 being superior to GIF with its palette limitation), but there is also a huge difference in the resulting file sizes. So, maybe it would be a good idea to optionally get MP4s instead of 'real' GIFs from Imgur.

2.
Not sure what gallery-dl is doing right now when running against a subreddit, I assume it uses what the reddit API is returning as its predefined default setting. If possible, adding support for the different sorting options on reddit might also be a good idea, i.e. hot new rising controversial top gilded, especially top with its different sup-options (all time, past hour, past 24 housrs, etc. )

mikf · 2017-05-27T15:03:08Z

I kind of forgot about imgur's use of gifv when rewriting the imgur extractor, but it should be fixed with bf452a8. Imgur thankfully provides the prefer_video flag in its metadata, which tells you if something advertised as GIF or GIFV also has a MP4 version. gallery-dl is currently prefering MP4 over GIF whenever possible, but I might add an option for that.
I've also removed the gifv extension from the direct-link regex.

Regarding the reddit API : I'm using https://reddit.com/r/<subreddit>/.json to get a listing of submissions in a specific subreddit, which refers to the hot sorting order and is just a .json attached to the end of the normal URL. Supporting all the other sorting options and sub-options shouldn't be too difficult if it follows the same "put .json at the end" rule.

Also: recursion is implemented (99b7213) and URLs that gallery-dl can't deal with can be written to a file via --write-unsupported <filename> (25bcdc8)

Example: https://www.reddit.com/r/<subreddit>/top/?sort=top&t=month (the 'sort=top' parameter is irrelevant and can be omitted)

Bfgeshka · 2017-06-02T10:05:12Z

Addtional request: what about support for option for grabbing only posts that are older/newer than certaion provided date or post? I think, actually it can be provided for some other modules too.

This feature can significantly decrease useless work in some cases, especially for reddit.

Not creating new issue because it is case-to-case feature and depends on api.

======

Found out that grabbed subreddits aren't complete, tried to dig into code.

And, according to this you have no need to limit number of returned comments (now it is 100). Without this limit gallery-dl gathers more links in long threads, so it works fine by me.

- Added the 'extractor.reddit.date-min' and '….date-max' config options. These values should be UTC timestamps. - All submissions not posted in date-min <= T <= date-max will be ignored. - Fixed the limit parameter for submission comments by setting it to its apparent max value (500).

mikf · 2017-06-03T12:27:47Z

Filtering of posts by creation-time implemented. It's not pretty, as it requires raw timestamps, but I could always extend this by adding a parser for human-readable dates.
As an example: gallery-dl reddit.com/r/<subreddit> -o date-min=1476000000 gets you all posts between "1476000000" and now.

Getting all posts newer/older than one specific one should be possible via the after and before parameters mentioned in the reddit API docs, so that is what I'm going to be working on next.

Found out that grabbed subreddits aren't complete

It should get all the posts/submissions, the issue here are comments.
You are correct in that I shouldn't have limited the number of returned comments to 100. I applied the same limitations of a subreddit-listing to a comment-listing.
I experimented a bit with this and it turns out that the default limit for comments is 200 and the maximum 500. One could load even more via the morechildren API method, if that is something you want.

Hrxn · 2017-06-05T07:39:21Z

Yeah, the comment threads on reddit, a story of its own. FYI, it works in the same way via browser, each comment thread is by default (i.e. initial display/loading of the page) limited to 200 comments. If you use reddit with an account, it's still the same, although you can increase the default limit in your account settings. Probably 500 as max as well, I had somehow 800 in the back of my head, but this could've been easily changed in the last couple years, I'm not sure.
The ordering and sorting options for subreddits are more or less the same for every single comment thread, so far the API behaviour always seems to match the usual experience with a browser, so I'd guess it's the same here, and the user will get the default sorting by top/best. (Which, by the way, isn't sorted by the absolute number of votes, or sum of votes, reddit instead uses its own special little algorithm for this, and takes stuff like upvote/downvote ratio into account, etc. )

But yeah, maybe @Bfgeshka could explain a bit what the plan was for comment threads, or maybe give an example or something.

A comment thread with a somewhat popular submission on one of the bigger subreddits has easily 2000-4000 comments, sometimes even more. But depending of the type, this are mostly text only comments, the usual reddit folklore like puns and reply chains etc.

Bfgeshka · 2017-06-05T08:00:28Z

Well, yes. There's still the way to grab more comments.

Method /api/morechildren gives an ability to load additional comments by their IDs, and method /comments/article can accept optional value in field comment, so it can focus tree rendering on certain comments branch.

Descriptions are vague, so it needs to be checked first.

Bfgeshka · 2017-06-05T14:23:57Z

So, appears that reddit requires to be logged in ( Exception: Unauthorized ) with this limit 500. Now it should be dropped by default unless user is logged in (what is not implemented yet, i guess).

mikf · 2017-06-05T18:57:38Z

The comment-limit should have nothing to do with whether or not you are authorized to view certain subreddits/submissions, some are just private. fbfc8d0 silently ignores these and just skips them (which I should have done to begin with).
I'm going to implement user authentication via OAuth2, which should allow gallery-dl to issue requests on your account's behalf and is basically the same as being logged in, but that requires some prep-work.

Also, if you want to experiment a bit: you can now set the comment limit via the 'extractor.reddit.comments' value. The default value is ~~500~~ 200, but you can test for yourself if a higher (or lower) value helps.

Call '$ gallery-dl oauth:reddit' to get a refresh_token for your account.

mikf · 2017-06-08T14:54:08Z

User authentication is implemented. Please test this and tell me if everything works and the process is understandable enough.

The main part is getting a refresh_token, which then gets used to get an access_token to make API calls on your account's behalf to access, for example, private subreddits.
You start by calling gallery-dl with oauth:reddit as argument:

$ gallery-dl oauth:reddit
Waiting for response. (Cancel with Ctrl+c)

This opens a new browser-tab on reddit, asking you if you are ok with granting permission to gallery-dl to make requests on your behalf. Click 'Accept' and you should see a message telling you your Refresh Token. Add this to your config file and the reddit extractors will use this when accessing reddit.

There are two issues with this:

Rate limits. If you are using the refresh_token, your requests to the reddit API are being rate limited at 600 requests every 10 minutes/600 seconds. This might be enough when you are downloading all the images you find, but it really isn't when you are using the -g option. In my tests I usually needed only 200-300 seconds to make all 600 requests.
gallery-dl needs to listen on port 6414 during the refresh_token retrieval, so there is a problem if some other process is using this port. This is probably never an issue, but you never know.

Bfgeshka · 2017-06-08T16:55:21Z

This opens a new browser-tab on reddit, asking you if you are ok with granting permission to gallery-dl to make requests on your behalf.

It is not, relly. Does not work for me, browser just won't open. I though it uses environment variables of some sort. So I've looked into a commit, but I've found only an invocation of webbrowser.open(url) and I have no idea how it supposed to find out this information about browser.

How about URL in a t terminal, just in case?

mikf · 2017-06-08T17:52:10Z

OK, try again (3ee77a0).

I've decided to use webbrowser.open, instead of printing the URL and asking the user to visit it, because it is a nightmare to select and copy text from a terminal on Windows.
Also the documentation says this:

If the environment variable BROWSER exists, it is interpreted as the os.pathsep-separated list of browsers to try ahead of the platform defaults

Bfgeshka · 2017-06-08T18:10:44Z

Well, with link it went much better, i've got my token and i'm going to check it out.

Also the documentation says this

It is odd, because this variable is sure set. $BROWSER is available and valid, so I can invoke echo $BROWSER successfully in the same shell I'm running gallery-dl from,

The 'extractor.reddit.morecomments' option enables the use of the '/api/morechildren' API endpoint (1) to load even more comments than the usual submission-request provides. Possible values are the booleans 'true' and 'false' (default). Note: this feature comes at the cost of 1 extra API call towards the rate limit for every 100 extra comments. (1) https://www.reddit.com/dev/api/#GET_api_morechildren

mikf · 2017-07-08T14:03:57Z

Are there any more unresolved problems that haven't been addressed yet or features that don't work as intended? Otherwise I kind of want to close this.
(Site-support requests for stuff that gets linked to from reddit should go into another issue)

Just a feature recap:

(99b7213) recursion
(bce51e9) submission sorting (new/top/...)
(5f05543) filter by timestamp
(f3d0373) filter by submission id
(80c2e03) filter by human readable dates
(56bec79) ability to load more comments
(090e11b) OAuth2 support

Bfgeshka · 2017-07-08T14:20:00Z

works good so far

mikf added a commit that referenced this issue May 23, 2017

[reddit] add subreddit- and submission-extractor

a22892f

- these extractors scan submissions and their comments for (external) URLs and defer them to other extractors - (#15)

mikf added a commit that referenced this issue May 23, 2017

[reddit] some small fixes

e425243

- filter or complete some URLs - remove the 'nofollow:' scheme before printing URLs - (#15)

mikf added a commit that referenced this issue May 27, 2017

add --write-unsupported option (#15)

25bcdc8

mikf added a commit that referenced this issue May 29, 2017

[reddit] support sorting options and sub-options (#15)

bce51e9

Example: https://www.reddit.com/r/<subreddit>/top/?sort=top&t=month (the 'sort=top' parameter is irrelevant and can be omitted)

mikf added the site:support label Jun 2, 2017

mikf added a commit that referenced this issue Jun 8, 2017

[reddit] enable user authentication with OAuth2 (#15)

090e11b

Call '$ gallery-dl oauth:reddit' to get a refresh_token for your account.

mikf closed this as completed Jul 8, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reddit images+links (recursive) dump #15

Reddit images+links (recursive) dump #15

Bfgeshka commented May 20, 2017

mikf commented May 20, 2017 •

edited

Loading

Bfgeshka commented May 20, 2017 •

edited

Loading

Hrxn commented May 21, 2017

Hrxn commented May 24, 2017

mikf commented May 24, 2017 •

edited

Loading

Bfgeshka commented May 24, 2017

mikf commented May 24, 2017

Bfgeshka commented May 25, 2017 •

edited

Loading

Hrxn commented May 26, 2017

mikf commented May 27, 2017

Bfgeshka commented Jun 2, 2017 •

edited

Loading

mikf commented Jun 3, 2017

Hrxn commented Jun 5, 2017 •

edited

Loading

Bfgeshka commented Jun 5, 2017

Bfgeshka commented Jun 5, 2017

mikf commented Jun 5, 2017 •

edited

Loading

mikf commented Jun 8, 2017 •

edited

Loading

Bfgeshka commented Jun 8, 2017

mikf commented Jun 8, 2017

Bfgeshka commented Jun 8, 2017

mikf commented Jul 8, 2017

Bfgeshka commented Jul 8, 2017

Reddit images+links (recursive) dump #15

Reddit images+links (recursive) dump #15

Comments

Bfgeshka commented May 20, 2017

mikf commented May 20, 2017 • edited Loading

Bfgeshka commented May 20, 2017 • edited Loading

Hrxn commented May 21, 2017

Hrxn commented May 24, 2017

mikf commented May 24, 2017 • edited Loading

Bfgeshka commented May 24, 2017

mikf commented May 24, 2017

Bfgeshka commented May 25, 2017 • edited Loading

Hrxn commented May 26, 2017

mikf commented May 27, 2017

Bfgeshka commented Jun 2, 2017 • edited Loading

mikf commented Jun 3, 2017

Hrxn commented Jun 5, 2017 • edited Loading

Bfgeshka commented Jun 5, 2017

Bfgeshka commented Jun 5, 2017

mikf commented Jun 5, 2017 • edited Loading

mikf commented Jun 8, 2017 • edited Loading

Bfgeshka commented Jun 8, 2017

mikf commented Jun 8, 2017

Bfgeshka commented Jun 8, 2017

mikf commented Jul 8, 2017

Bfgeshka commented Jul 8, 2017

mikf commented May 20, 2017 •

edited

Loading

Bfgeshka commented May 20, 2017 •

edited

Loading

mikf commented May 24, 2017 •

edited

Loading

Bfgeshka commented May 25, 2017 •

edited

Loading

Bfgeshka commented Jun 2, 2017 •

edited

Loading

Hrxn commented Jun 5, 2017 •

edited

Loading

mikf commented Jun 5, 2017 •

edited

Loading

mikf commented Jun 8, 2017 •

edited

Loading