Parallelize feed parser, make it run forever #547

humphd · 2020-01-18T21:51:58Z

Issue This PR Addresses

Fixes #503

Type of Change

Bugfix: Change which fixes an issue
New Feature: Change which adds functionality
Documentation Update: Change which improves documentation
UI: Change which improves UI

Description

This PR improves our feed parser. Specifically, it makes it run forever, continually downloading the feeds and parsing them into Redis. It also allows for parallelization, so we can run multiple instances of our feed worker in different processes.

In order to make this work, I've also done a few other things:

Moved the feed parsing to its own file, src/backend/feed/processor.js, so that we can run it in a separate process or processes
Altered our retry attempts and delay timing when feeds fail. What I have still isn't quite right, but what we had before was way to extreme, and the queue would back-up with failed jobs
Added a new env variable: FEED_QUEUE_PARALLEL_WORKERS. This is the number of parallel worker instances to run in separate processes, and can be 1, 2, 3... up to *, which means "one per CPU." I cap this at the CPU count so it doesn't bring the server down.
I've started passing HTTP options to request via the feedparser parse function. I wanted to shorten the timeout length, so dead blogs don't block a worker for so long. I've also enabled gzip, which makes things faster.
I've used the drained event on the feed queue to trigger another round of updates. Whenever we finish processing one set of feeds, and drained occurs, we'll start over. This is our infinite loop.
I've changed the job.id to be the feed URL instead of an auto-incrementing integer. This allows some extra logic so that multiple jobs for the same feed don't get added to the queue (Bull will reject attempts to add the same job/URL more than once).

This is only the beginning of this work. We can do a bunch more to improve things, especially with blacklisting blogs that fail over and over, adding cache headers so we don't re-process feeds that haven't changed since the last time we updated, etc. But for now, this will get us a server that does what we expect: auto update in an infinite loop.

Checklist

Quality: This PR builds and passes our npm test and works locally
Tests: This PR includes thorough tests or an explanation of why it does not
Screenshots: This PR includes screenshots or GIFs of the changes made or an explanation of why it does not (if applicable)
Documentation: This PR includes updated/added documentation to user exposed functionality or configuration variables are added/changed or an explanation of why it does not(if applicable)

Grommers00

I still think this moves us forward alot! More just questions/comments then concerns!

env.example

tools/add-feed.js

humphd · 2020-01-19T20:08:04Z

While I was working on this PR, I noticed we had duplicate code to add feeds. I've included another commit (7642ea1) to refactor this out into its own function. This way when we play with these retry/backoff numbers, we won't have to remember to change things in multiple places.

cindyorangis

Obviously approving this perfect PR :P

Parallelize feed parser, make it run forever

afbe9ca

humphd added type: enhancement New feature or request area: back-end labels Jan 18, 2020

humphd self-assigned this Jan 18, 2020

Grommers00 self-requested a review January 19, 2020 00:35

Grommers00 previously approved these changes Jan 19, 2020

View reviewed changes

env.example Show resolved Hide resolved

tools/add-feed.js Outdated Show resolved Hide resolved

Refactor duplicate queue.add() calls into one.

7642ea1

humphd dismissed Grommers00’s stale review via 7642ea1 January 19, 2020 19:59

Grommers00 approved these changes Jan 19, 2020

View reviewed changes

humphd mentioned this pull request Jan 20, 2020

Standardize on base64 encoded, SHA-256 hashed, normalized URIs as Redis keys #522

Closed

cindyorangis approved these changes Jan 20, 2020

View reviewed changes

cindyorangis merged commit 60eeffe into Seneca-CDOT:master Jan 20, 2020

humphd mentioned this pull request Jan 21, 2020

Fix logic for picking FEED_QUEUE_PARALLEL_WORKERS with missing env var #563

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelize feed parser, make it run forever #547

Parallelize feed parser, make it run forever #547

humphd commented Jan 18, 2020

Grommers00 left a comment

humphd commented Jan 19, 2020

cindyorangis left a comment

Parallelize feed parser, make it run forever #547

Parallelize feed parser, make it run forever #547

Conversation

humphd commented Jan 18, 2020

Issue This PR Addresses

Type of Change

Description

Checklist

Grommers00 left a comment

Choose a reason for hiding this comment

humphd commented Jan 19, 2020

cindyorangis left a comment

Choose a reason for hiding this comment