Reddit Feed Fetcher

Consume Reddit Feed and Download Imgur Albums

Motive

I wanted some script/program that could listen in on my saved post and check if it was an imgur album it could download and store for future use/backup.

Reddit provides personal RSS feed. That is, I have an RSS/Atom feed for my saved posts. Using that, I can periodically listen/query this endpoint to fetch for new post I have saved and do some rudimentary checking before downloading the album (i.e. check if its an imgur posts, from a particular sub, etc.). From there, I can use parse out information to pass into the imgur API to fetch the raw image links.

Notes

This project is still pretty buggy and inefficient. I should probably use a connection pool and have some way to throttle my connection when starting to process. I should also include someway to check if I have already downloaded something to avoid using up bandwidth. All in all, I hacked this together quickly one night.

TODO:

Connection Pooling
Fetch past recent items on rss feed
Add ability to skip processed items on rate limit
Avoid fetching existing images based on path
Add rate limiting mitigations/throttling
Add option to only process recent items
Fix occassional hiccups with undefined path args and timeout (x number of retries?)

Structure

REDDIT_SAVED_RSS_FEED="link to reddit rss feed"
IMGUR_CLIENT_SECRET="imgur client secret"

ENABLE_LOG_SUMMARY="true or false value to enable more robust logging"
DESTINATION="where to drop off image and pdf"
CONNECTIONS=integer value of sockets to use for requests
START_AFTER="reddit id to start after"
SINGLE_BATCH="if defined, only one batch will execute"

.env variables needed to be defined.

All images are stored using this path convention

DESTINATION/SUBREDDIT_SOURCE/POST_TITLE                 # base path
                                       /page_0[1-9].png 
                                       /page_[10+].png  # prefixed 0 if page download is between 0-9
                                       /POST_TITLE.pdf  # all pages stitched together

Use START_AFTER if you are rate limited skip processing of previous feed elements. A reddit id is posted in the logs every batch. Use CONNECTIONS to use set a limit on the number of connections used to fetch data.

After every rss batch, there will be a 3 second delay before the next batch starts. This is to just to allow I/O processes to keep up and as a rudimentary delay from hitting the imgur servers too frequently.

PDF stitch will always be regenerated. I currently do not have a way to detect if a previously missing image has been fetched (i.e. missing because of rate limiting/partial batch processing).

How to use

git clone https://github.com/lamdaV/RedditFeedFetcher.git
yarn install or npm install
create a .env file and fill out relevant information (see above)
yarn start or npm run start
- On Linux or OSX environment, run yarn start | tee path/to/output.log for both stdout logging and file logging.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
src		src
.gitignore		.gitignore
README.md		README.md
package.json		package.json
yarn.lock		yarn.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Reddit Feed Fetcher

Motive

Notes

TODO:

Structure

How to use

About

Releases

Packages

Contributors 2

Languages

lamdav/RedditFeedFetcher

Folders and files

Latest commit

History

Repository files navigation

Reddit Feed Fetcher

Motive

Notes

TODO:

Structure

How to use

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages