Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelize feed parser, make it run forever #547

Merged
merged 2 commits into from
Jan 20, 2020

Conversation

humphd
Copy link
Contributor

@humphd humphd commented Jan 18, 2020

Issue This PR Addresses

Fixes #503

Type of Change

  • Bugfix: Change which fixes an issue
  • New Feature: Change which adds functionality
  • Documentation Update: Change which improves documentation
  • UI: Change which improves UI

Description

This PR improves our feed parser. Specifically, it makes it run forever, continually downloading the feeds and parsing them into Redis. It also allows for parallelization, so we can run multiple instances of our feed worker in different processes.

In order to make this work, I've also done a few other things:

  • Moved the feed parsing to its own file, src/backend/feed/processor.js, so that we can run it in a separate process or processes
  • Altered our retry attempts and delay timing when feeds fail. What I have still isn't quite right, but what we had before was way to extreme, and the queue would back-up with failed jobs
  • Added a new env variable: FEED_QUEUE_PARALLEL_WORKERS. This is the number of parallel worker instances to run in separate processes, and can be 1, 2, 3... up to *, which means "one per CPU." I cap this at the CPU count so it doesn't bring the server down.
  • I've started passing HTTP options to request via the feedparser parse function. I wanted to shorten the timeout length, so dead blogs don't block a worker for so long. I've also enabled gzip, which makes things faster.
  • I've used the drained event on the feed queue to trigger another round of updates. Whenever we finish processing one set of feeds, and drained occurs, we'll start over. This is our infinite loop.
  • I've changed the job.id to be the feed URL instead of an auto-incrementing integer. This allows some extra logic so that multiple jobs for the same feed don't get added to the queue (Bull will reject attempts to add the same job/URL more than once).

This is only the beginning of this work. We can do a bunch more to improve things, especially with blacklisting blogs that fail over and over, adding cache headers so we don't re-process feeds that haven't changed since the last time we updated, etc. But for now, this will get us a server that does what we expect: auto update in an infinite loop.

Checklist

  • Quality: This PR builds and passes our npm test and works locally
  • Tests: This PR includes thorough tests or an explanation of why it does not
  • Screenshots: This PR includes screenshots or GIFs of the changes made or an explanation of why it does not (if applicable)
  • Documentation: This PR includes updated/added documentation to user exposed functionality or configuration variables are added/changed or an explanation of why it does not(if applicable)

@humphd humphd added type: enhancement New feature or request area: back-end labels Jan 18, 2020
@humphd humphd self-assigned this Jan 18, 2020
@Grommers00 Grommers00 self-requested a review January 19, 2020 00:35
Grommers00
Grommers00 previously approved these changes Jan 19, 2020
Copy link
Contributor

@Grommers00 Grommers00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think this moves us forward alot! More just questions/comments then concerns!

env.example Show resolved Hide resolved
tools/add-feed.js Outdated Show resolved Hide resolved
@humphd
Copy link
Contributor Author

humphd commented Jan 19, 2020

While I was working on this PR, I noticed we had duplicate code to add feeds. I've included another commit (7642ea1) to refactor this out into its own function. This way when we play with these retry/backoff numbers, we won't have to remember to change things in multiple places.

Copy link
Contributor

@cindyorangis cindyorangis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Obviously approving this perfect PR :P

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Telescope needs to continually update
4 participants