-
-
Notifications
You must be signed in to change notification settings - Fork 991
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reddit images+links (recursive) dump #15
Comments
I've looked a bit into the reddit API and
seems quite doable, so I will be working on implementing something like this. If this isn't what you had in mind, then please correct me and/or provide a concrete example of what you want. Also what exactly do you mean by recursive dump? |
Yes, that's about what I have in mind.
I do mean something like: if comment or post includes links to other reddit posts (not subreddits, it would be an endless mess) in different subreddit, they can be parsed as well. As optional behavior, of course. |
And links to other reddit posts in that link to other reddit posts? 😉 What I'm trying to say, this alone would already end up in being an endless mess, without going recursively from subreddit to subreddit. So be careful with that. And in some way, the depth must be limited when doing anything recursively. But in general, I think supporting reddit is a really good idea. I'm not really a heavy user of reddit, but I know myself around on that site, so just @-mention me here and I'll help with testing and stuff. There are fundamentally only two types of posts that can be submitted to reddit:
But the comment thread that is part of a text only submission can include links to external resources such as direct links to images, etc., obviously. This also applies to the other type of post, of course. In this second case, the big majority of all subreddits dealing with images features something like this:
Many subreddits restrict hosting to only these domains, so that would be the easy part. The other easy variant is a post with a direct link to an image as the URL, just hosted on god knows where. But this is just a simple HTTP GET away, not really a big deal. My point is that supporting all different variants of posts on reddit is a pretty big deal. So we should maybe use this space and opportunity here and now to try to reach a consensus on what would be a viable start, and a constructive way forward from there on. 😄 A link with more than enough subreddits of images for the interested: |
- these extractors scan submissions and their comments for (external) URLs and defer them to other extractors - (#15)
- filter or complete some URLs - remove the 'nofollow:' scheme before printing URLs - (#15)
So, what is supported so far? Direct external links to images and |
First of all: thank you for that huge amount of information you provided. It has been quite helpful. So far I've only been working on those three points I posted above, which should be working by now. The next thing to implement would be recursion and all the different modules for gfycat, flickr and so on. The only URLs that are supported by now are the ones that are already recognized, i.e. imgur albums. I thought I had a solution for direct links, which should be easy, but that affects the
edit: Direct links are now supported as well. |
How are you going to handle generic links (articles, sites and so on)? They shouldn't be left behind as well. I do suggest optional saving of links in text file. |
I hadn't originally planned to do anything special with unsupported links. They are (obviously) ignored when downloading, but you can still get them by using the The issue here is that there is currently no distinction between supported and unsupported URLs when listing them with # filter between supported/unsupported
$ gallery-dl -g --unsupported [URL] > file
# prefix with '#'
$ gallery-dl -g [URL] | grep "^#" | cut -d "#" -f 2- > file The second variant seems a bit too complex and wouldn't really be possible for Windows users as well, so it is either the first one, or, as you suggested, an option to simply write them to a file. I'm not quite sure which of these two i prefer. |
Too complicated, yeah.
This one is pretty good, actually. Well, as long as all links would be listed in one way on another, it's fine. |
reddit extractors now recursively visit other submissions/posts linked to in the initial set of submissions. This behaviour can be configured via the 'extractor.reddit.recursion' key in the configuration file or by `-o recursion=<value>`. Example: {"extractor": { "reddit": { "recursion": <value> }}} Possible values: * -1 - infinite recursion (don't do this) * 0 - recursion is disabled (default) * 1 and higher - maximum recursion level
1. Imgur extraction seems to work here, getting real And by the way, I just noticed this from directlink.py: Matching gifv won't be really of any use, I think. Because, as far as I know, it's not a real file format and not used anywhere. What you actually get is this: https://gist.github.com/anonymous/3d6330be06578a6ab2bb31074b8df321 HTML, with a bit of JavaScript and an embedded 'video' element. The link to the 'real' file (MP4) can be seen there. (Or, in the browser: Right click > Copy video address). So far, I know of Imgur, but I've also seen that reddit itself uses this "trick" for GIFs hosted on its The thing is, there is not just a difference in visual quality (MP4 being superior to GIF with its palette limitation), but there is also a huge difference in the resulting file sizes. So, maybe it would be a good idea to optionally get MP4s instead of 'real' GIFs from Imgur. 2. |
I kind of forgot about imgur's use of gifv when rewriting the imgur extractor, but it should be fixed with bf452a8. Imgur thankfully provides the Regarding the reddit API : I'm using Also: recursion is implemented (99b7213) and URLs that gallery-dl can't deal with can be written to a file via |
Example: https://www.reddit.com/r/<subreddit>/top/?sort=top&t=month (the 'sort=top' parameter is irrelevant and can be omitted)
Addtional request: what about support for option for grabbing only posts that are older/newer than certaion provided date or post? I think, actually it can be provided for some other modules too. This feature can significantly decrease useless work in some cases, especially for reddit. Not creating new issue because it is case-to-case feature and depends on api. ====== Found out that grabbed subreddits aren't complete, tried to dig into code. And, according to this you have no need to limit number of returned comments (now it is 100). Without this limit gallery-dl gathers more links in long threads, so it works fine by me. |
- Added the 'extractor.reddit.date-min' and '….date-max' config options. These values should be UTC timestamps. - All submissions not posted in date-min <= T <= date-max will be ignored. - Fixed the limit parameter for submission comments by setting it to its apparent max value (500).
Filtering of posts by creation-time implemented. It's not pretty, as it requires raw timestamps, but I could always extend this by adding a parser for human-readable dates. Getting all posts newer/older than one specific one should be possible via the
It should get all the posts/submissions, the issue here are comments. |
Yeah, the comment threads on reddit, a story of its own. FYI, it works in the same way via browser, each comment thread is by default (i.e. initial display/loading of the page) limited to 200 comments. If you use reddit with an account, it's still the same, although you can increase the default limit in your account settings. Probably 500 as max as well, I had somehow 800 in the back of my head, but this could've been easily changed in the last couple years, I'm not sure. But yeah, maybe @Bfgeshka could explain a bit what the plan was for comment threads, or maybe give an example or something. A comment thread with a somewhat popular submission on one of the bigger subreddits has easily 2000-4000 comments, sometimes even more. But depending of the type, this are mostly text only comments, the usual reddit folklore like puns and reply chains etc. |
Well, yes. There's still the way to grab more comments. Method /api/morechildren gives an ability to load additional comments by their IDs, and method /comments/article can accept optional value in field Descriptions are vague, so it needs to be checked first. |
So, appears that reddit requires to be logged in ( |
The comment-limit should have nothing to do with whether or not you are authorized to view certain subreddits/submissions, some are just private. fbfc8d0 silently ignores these and just skips them (which I should have done to begin with). Also, if you want to experiment a bit: you can now set the comment limit via the 'extractor.reddit.comments' value. The default value is |
Call '$ gallery-dl oauth:reddit' to get a refresh_token for your account.
User authentication is implemented. Please test this and tell me if everything works and the process is understandable enough. The main part is getting a
This opens a new browser-tab on reddit, asking you if you are ok with granting permission to gallery-dl to make requests on your behalf. Click 'Accept' and you should see a message telling you your Refresh Token. Add this to your config file and the reddit extractors will use this when accessing reddit. There are two issues with this:
|
It is not, relly. Does not work for me, browser just won't open. I though it uses environment variables of some sort. So I've looked into a commit, but I've found only an invocation of How about URL in a t terminal, just in case? |
OK, try again (3ee77a0). I've decided to use webbrowser.open, instead of printing the URL and asking the user to visit it, because it is a nightmare to select and copy text from a terminal on Windows.
|
Well, with link it went much better, i've got my token and i'm going to check it out.
It is odd, because this variable is sure set. $BROWSER is available and valid, so I can invoke |
The 'extractor.reddit.morecomments' option enables the use of the '/api/morechildren' API endpoint (1) to load even more comments than the usual submission-request provides. Possible values are the booleans 'true' and 'false' (default). Note: this feature comes at the cost of 1 extra API call towards the rate limit for every 100 extra comments. (1) https://www.reddit.com/dev/api/#GET_api_morechildren
Are there any more unresolved problems that haven't been addressed yet or features that don't work as intended? Otherwise I kind of want to close this. Just a feature recap: |
works good so far |
What is your opinion on this, can you do it? Not requesting it directly, just asking because it can be tricky.
Plenty of subreddits are used as galleries. Comment section is no less important, so it deserves grabbing too.
Key pooint is that other modules of gallery-dl can be used for link processing.
The text was updated successfully, but these errors were encountered: