Parallelize feed parser, make it run forever #547
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Issue This PR Addresses
Fixes #503
Type of Change
Description
This PR improves our feed parser. Specifically, it makes it run forever, continually downloading the feeds and parsing them into Redis. It also allows for parallelization, so we can run multiple instances of our feed worker in different processes.
In order to make this work, I've also done a few other things:
src/backend/feed/processor.js
, so that we can run it in a separate process or processesenv
variable:FEED_QUEUE_PARALLEL_WORKERS
. This is the number of parallel worker instances to run in separate processes, and can be 1, 2, 3... up to*
, which means "one per CPU." I cap this at the CPU count so it doesn't bring the server down.request
via the feedparserparse
function. I wanted to shorten the timeout length, so dead blogs don't block a worker for so long. I've also enabled gzip, which makes things faster.drained
event on the feed queue to trigger another round of updates. Whenever we finish processing one set of feeds, anddrained
occurs, we'll start over. This is our infinite loop.job.id
to be the feed URL instead of an auto-incrementing integer. This allows some extra logic so that multiple jobs for the same feed don't get added to the queue (Bull will reject attempts to add the same job/URL more than once).This is only the beginning of this work. We can do a bunch more to improve things, especially with blacklisting blogs that fail over and over, adding cache headers so we don't re-process feeds that haven't changed since the last time we updated, etc. But for now, this will get us a server that does what we expect: auto update in an infinite loop.
Checklist