Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reddit images+links (recursive) dump #15

Closed
Bfgeshka opened this issue May 20, 2017 · 22 comments
Closed

Reddit images+links (recursive) dump #15

Bfgeshka opened this issue May 20, 2017 · 22 comments

Comments

@Bfgeshka
Copy link

What is your opinion on this, can you do it? Not requesting it directly, just asking because it can be tricky.

Plenty of subreddits are used as galleries. Comment section is no less important, so it deserves grabbing too.

Key pooint is that other modules of gallery-dl can be used for link processing.

@mikf
Copy link
Owner

mikf commented May 20, 2017

I've looked a bit into the reddit API and

  • fetching all threads/posts of a subreddit
  • getting their respective comments
  • scanning these for external links/images

seems quite doable, so I will be working on implementing something like this.

If this isn't what you had in mind, then please correct me and/or provide a concrete example of what you want. Also what exactly do you mean by recursive dump?

@Bfgeshka
Copy link
Author

Bfgeshka commented May 20, 2017

fetching all threads/posts of a subreddit
getting their respective comments
scanning these for external links/images

Yes, that's about what I have in mind.

Also what exactly do you mean by recursive dump?

I do mean something like: if comment or post includes links to other reddit posts (not subreddits, it would be an endless mess) in different subreddit, they can be parsed as well. As optional behavior, of course.

@Hrxn
Copy link
Contributor

Hrxn commented May 21, 2017

And links to other reddit posts in that link to other reddit posts? 😉

What I'm trying to say, this alone would already end up in being an endless mess, without going recursively from subreddit to subreddit.

So be careful with that. And in some way, the depth must be limited when doing anything recursively.
So I strongly suggest to start simple and straightforward instead.

But in general, I think supporting reddit is a really good idea.

I'm not really a heavy user of reddit, but I know myself around on that site, so just @-mention me here and I'll help with testing and stuff.

There are fundamentally only two types of posts that can be submitted to reddit:

  • Text only (Which is probably not interesting at all for our case here)
  • Link/URL referring to an external resource.

But the comment thread that is part of a text only submission can include links to external resources such as direct links to images, etc., obviously. This also applies to the other type of post, of course.

In this second case, the big majority of all subreddits dealing with images features something like this:

  • The link post is a direct link to an image hosted on Imgur.
  • The picture is hosted by reddit (Domain: i.redd.it). Not directly linked from the subreddit itself unfortunately.
  • The link is to an album on Imgur (http(s)://imgur.com/a/<ID>, or imgur.com/gallery/<ID>) The ID being a 5-digit alphanumeric code.
  • Idiots linking to the Imgur page of a single image when they could use the direct link instead.

Many subreddits restrict hosting to only these domains, so that would be the easy part.
But other subreddits, other rules, so this doesn't apply everywhere.

The other easy variant is a post with a direct link to an image as the URL, just hosted on god knows where. But this is just a simple HTTP GET away, not really a big deal.
Sometimes, this is a direct link to a file hosted on Flickr. (xy.staticflickr.com/?????....) but sometimes it is the photo page on Flickr.
Also very popular on reddit: GIFs (huge surprise). But the real .gif image file format is on a steady decline, vastly outnumbered by "GIFV" on Imgur and links to Gfycat. In these cases, what we get is actually an MP4 file (or WebM, also available usually).

My point is that supporting all different variants of posts on reddit is a pretty big deal. So we should maybe use this space and opportunity here and now to try to reach a consensus on what would be a viable start, and a constructive way forward from there on. 😄

A link with more than enough subreddits of images for the interested:
https://www.reddit.com/r/sfwpornnetwork/wiki/network

mikf added a commit that referenced this issue May 23, 2017
- these extractors scan submissions and their comments for
  (external) URLs and defer them to other extractors
- (#15)
mikf added a commit that referenced this issue May 23, 2017
- filter or complete some URLs
- remove the 'nofollow:' scheme before printing URLs
- (#15)
@Hrxn
Copy link
Contributor

Hrxn commented May 24, 2017

So, what is supported so far? Direct external links to images and i.redd.it, as well as URLs gallery-dl already recognizes?

@mikf
Copy link
Owner

mikf commented May 24, 2017

First of all: thank you for that huge amount of information you provided. It has been quite helpful.

So far I've only been working on those three points I posted above, which should be working by now. The next thing to implement would be recursion and all the different modules for gfycat, flickr and so on.

The only URLs that are supported by now are the ones that are already recognized, i.e. imgur albums. I thought I had a solution for direct links, which should be easy, but that affects the recursive extractor and makes it quite less useful.

It it currently not possible to just enable/disable extractors on the fly, which would solve this issue, so that is another thing to work on.

edit: Direct links are now supported as well.

@Bfgeshka
Copy link
Author

How are you going to handle generic links (articles, sites and so on)? They shouldn't be left behind as well. I do suggest optional saving of links in text file.

@mikf
Copy link
Owner

mikf commented May 24, 2017

I hadn't originally planned to do anything special with unsupported links. They are (obviously) ignored when downloading, but you can still get them by using the -g option.

The issue here is that there is currently no distinction between supported and unsupported URLs when listing them with -g. I could either implement a filter to print supported/unsupported ones only, or prefix unsupported URLs with a '#'. Saving them to a file could then be done by

# filter between supported/unsupported
$ gallery-dl -g --unsupported [URL] > file

# prefix with '#'
$ gallery-dl -g [URL] | grep "^#" | cut -d "#" -f 2- > file

The second variant seems a bit too complex and wouldn't really be possible for Windows users as well, so it is either the first one, or, as you suggested, an option to simply write them to a file. I'm not quite sure which of these two i prefer.

@Bfgeshka
Copy link
Author

Bfgeshka commented May 25, 2017

$ gallery-dl -g [URL] | grep "^#" | cut -d "#" -f 2- > file

Too complicated, yeah.

$ gallery-dl -g --unsupported [URL] > file

This one is pretty good, actually.

Well, as long as all links would be listed in one way on another, it's fine.

mikf added a commit that referenced this issue May 26, 2017
reddit extractors now recursively visit other submissions/posts
linked to in the initial set of submissions.
This behaviour can be configured via the 'extractor.reddit.recursion'
key in the configuration file or by `-o recursion=<value>`.

Example:
{"extractor": {
  "reddit": {
   "recursion": <value>
}}}

Possible values:
* -1 - infinite recursion (don't do this)
*  0 - recursion is disabled (default)
*  1 and higher - maximum recursion level
@Hrxn
Copy link
Contributor

Hrxn commented May 26, 2017

1.
For subreddits like https://www.reddit.com/r/gifs/, usage seems to be something like 90% Imgur, 10% Gfycat and 1% redd.it.

Imgur extraction seems to work here, getting real .gif image files. Nothing wrong with that, but as you can see from the usage on reddit itself, most if not all "gifs" are using GIFV on Imgur, the URLs being like this: http://i.imgur.com/0gybAXR.gifv

And by the way, I just noticed this from directlink.py:
pattern = [r"https?://[^?&#]+\.(?:jpe?g|png|gifv?|webm|mp4)"]

Matching gifv won't be really of any use, I think. Because, as far as I know, it's not a real file format and not used anywhere. What you actually get is this: https://gist.github.com/anonymous/3d6330be06578a6ab2bb31074b8df321
Can be seen with using the URL (http://i.imgur.com/0gybAXR.gifv) with curl, for example.

HTML, with a bit of JavaScript and an embedded 'video' element. The link to the 'real' file (MP4) can be seen there. (Or, in the browser: Right click > Copy video address).

So far, I know of Imgur, but I've also seen that reddit itself uses this "trick" for GIFs hosted on its i.redd.it domain.

The thing is, there is not just a difference in visual quality (MP4 being superior to GIF with its palette limitation), but there is also a huge difference in the resulting file sizes. So, maybe it would be a good idea to optionally get MP4s instead of 'real' GIFs from Imgur.

2.
Not sure what gallery-dl is doing right now when running against a subreddit, I assume it uses what the reddit API is returning as its predefined default setting. If possible, adding support for the different sorting options on reddit might also be a good idea, i.e. hot new rising controversial top gilded, especially top with its different sup-options (all time, past hour, past 24 housrs, etc. )

mikf added a commit that referenced this issue May 27, 2017
@mikf
Copy link
Owner

mikf commented May 27, 2017

I kind of forgot about imgur's use of gifv when rewriting the imgur extractor, but it should be fixed with bf452a8. Imgur thankfully provides the prefer_video flag in its metadata, which tells you if something advertised as GIF or GIFV also has a MP4 version. gallery-dl is currently prefering MP4 over GIF whenever possible, but I might add an option for that.
I've also removed the gifv extension from the direct-link regex.

Regarding the reddit API : I'm using https://reddit.com/r/<subreddit>/.json to get a listing of submissions in a specific subreddit, which refers to the hot sorting order and is just a .json attached to the end of the normal URL. Supporting all the other sorting options and sub-options shouldn't be too difficult if it follows the same "put .json at the end" rule.

Also: recursion is implemented (99b7213) and URLs that gallery-dl can't deal with can be written to a file via --write-unsupported <filename> (25bcdc8)

mikf added a commit that referenced this issue May 29, 2017
@Bfgeshka
Copy link
Author

Bfgeshka commented Jun 2, 2017

Addtional request: what about support for option for grabbing only posts that are older/newer than certaion provided date or post? I think, actually it can be provided for some other modules too.

This feature can significantly decrease useless work in some cases, especially for reddit.

Not creating new issue because it is case-to-case feature and depends on api.

======

Found out that grabbed subreddits aren't complete, tried to dig into code.

And, according to this you have no need to limit number of returned comments (now it is 100). Without this limit gallery-dl gathers more links in long threads, so it works fine by me.

mikf added a commit that referenced this issue Jun 3, 2017
- Added the 'extractor.reddit.date-min' and '….date-max'
  config options. These values should be UTC timestamps.
- All submissions not posted in date-min <= T <= date-max
  will be ignored.

- Fixed the limit parameter for submission comments by setting
  it to its apparent max value (500).
@mikf
Copy link
Owner

mikf commented Jun 3, 2017

Filtering of posts by creation-time implemented. It's not pretty, as it requires raw timestamps, but I could always extend this by adding a parser for human-readable dates.
As an example: gallery-dl reddit.com/r/<subreddit> -o date-min=1476000000 gets you all posts between "1476000000" and now.

Getting all posts newer/older than one specific one should be possible via the after and before parameters mentioned in the reddit API docs, so that is what I'm going to be working on next.

Found out that grabbed subreddits aren't complete

It should get all the posts/submissions, the issue here are comments.
You are correct in that I shouldn't have limited the number of returned comments to 100. I applied the same limitations of a subreddit-listing to a comment-listing.
I experimented a bit with this and it turns out that the default limit for comments is 200 and the maximum 500. One could load even more via the morechildren API method, if that is something you want.

@Hrxn
Copy link
Contributor

Hrxn commented Jun 5, 2017

Yeah, the comment threads on reddit, a story of its own. FYI, it works in the same way via browser, each comment thread is by default (i.e. initial display/loading of the page) limited to 200 comments. If you use reddit with an account, it's still the same, although you can increase the default limit in your account settings. Probably 500 as max as well, I had somehow 800 in the back of my head, but this could've been easily changed in the last couple years, I'm not sure.
The ordering and sorting options for subreddits are more or less the same for every single comment thread, so far the API behaviour always seems to match the usual experience with a browser, so I'd guess it's the same here, and the user will get the default sorting by top/best. (Which, by the way, isn't sorted by the absolute number of votes, or sum of votes, reddit instead uses its own special little algorithm for this, and takes stuff like upvote/downvote ratio into account, etc. )

But yeah, maybe @Bfgeshka could explain a bit what the plan was for comment threads, or maybe give an example or something.

A comment thread with a somewhat popular submission on one of the bigger subreddits has easily 2000-4000 comments, sometimes even more. But depending of the type, this are mostly text only comments, the usual reddit folklore like puns and reply chains etc.

@Bfgeshka
Copy link
Author

Bfgeshka commented Jun 5, 2017

Well, yes. There's still the way to grab more comments.

Method /api/morechildren gives an ability to load additional comments by their IDs, and method /comments/article can accept optional value in field comment, so it can focus tree rendering on certain comments branch.

Descriptions are vague, so it needs to be checked first.

@Bfgeshka
Copy link
Author

Bfgeshka commented Jun 5, 2017

So, appears that reddit requires to be logged in ( Exception: Unauthorized ) with this limit 500. Now it should be dropped by default unless user is logged in (what is not implemented yet, i guess).

@mikf
Copy link
Owner

mikf commented Jun 5, 2017

The comment-limit should have nothing to do with whether or not you are authorized to view certain subreddits/submissions, some are just private. fbfc8d0 silently ignores these and just skips them (which I should have done to begin with).
I'm going to implement user authentication via OAuth2, which should allow gallery-dl to issue requests on your account's behalf and is basically the same as being logged in, but that requires some prep-work.

Also, if you want to experiment a bit: you can now set the comment limit via the 'extractor.reddit.comments' value. The default value is 500 200, but you can test for yourself if a higher (or lower) value helps.

mikf added a commit that referenced this issue Jun 8, 2017
Call '$ gallery-dl oauth:reddit' to get a refresh_token
for your account.
@mikf
Copy link
Owner

mikf commented Jun 8, 2017

User authentication is implemented. Please test this and tell me if everything works and the process is understandable enough.

The main part is getting a refresh_token, which then gets used to get an access_token to make API calls on your account's behalf to access, for example, private subreddits.
You start by calling gallery-dl with oauth:reddit as argument:

$ gallery-dl oauth:reddit
Waiting for response. (Cancel with Ctrl+c)

This opens a new browser-tab on reddit, asking you if you are ok with granting permission to gallery-dl to make requests on your behalf. Click 'Accept' and you should see a message telling you your Refresh Token. Add this to your config file and the reddit extractors will use this when accessing reddit.

There are two issues with this:

  • Rate limits. If you are using the refresh_token, your requests to the reddit API are being rate limited at 600 requests every 10 minutes/600 seconds. This might be enough when you are downloading all the images you find, but it really isn't when you are using the -g option. In my tests I usually needed only 200-300 seconds to make all 600 requests.
  • gallery-dl needs to listen on port 6414 during the refresh_token retrieval, so there is a problem if some other process is using this port. This is probably never an issue, but you never know.

@Bfgeshka
Copy link
Author

Bfgeshka commented Jun 8, 2017

This opens a new browser-tab on reddit, asking you if you are ok with granting permission to gallery-dl to make requests on your behalf.

It is not, relly. Does not work for me, browser just won't open. I though it uses environment variables of some sort. So I've looked into a commit, but I've found only an invocation of webbrowser.open(url) and I have no idea how it supposed to find out this information about browser.

How about URL in a t terminal, just in case?

@mikf
Copy link
Owner

mikf commented Jun 8, 2017

OK, try again (3ee77a0).

I've decided to use webbrowser.open, instead of printing the URL and asking the user to visit it, because it is a nightmare to select and copy text from a terminal on Windows.
Also the documentation says this:

If the environment variable BROWSER exists, it is interpreted as the os.pathsep-separated list of browsers to try ahead of the platform defaults

@Bfgeshka
Copy link
Author

Bfgeshka commented Jun 8, 2017

Well, with link it went much better, i've got my token and i'm going to check it out.

Also the documentation says this

It is odd, because this variable is sure set. $BROWSER is available and valid, so I can invoke echo $BROWSER successfully in the same shell I'm running gallery-dl from,

mikf added a commit that referenced this issue Jun 13, 2017
The 'extractor.reddit.morecomments' option enables the use of
the '/api/morechildren' API endpoint (1) to load even more
comments than the usual submission-request provides.
Possible values are the booleans 'true' and 'false' (default).

Note: this feature comes at the cost of 1 extra API call towards
the rate limit for every 100 extra comments.

(1) https://www.reddit.com/dev/api/#GET_api_morechildren
@mikf
Copy link
Owner

mikf commented Jul 8, 2017

Are there any more unresolved problems that haven't been addressed yet or features that don't work as intended? Otherwise I kind of want to close this.
(Site-support requests for stuff that gets linked to from reddit should go into another issue)

Just a feature recap:

@Bfgeshka
Copy link
Author

Bfgeshka commented Jul 8, 2017

works good so far

@mikf mikf closed this as completed Jul 8, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants