Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Analyze queue options for long running SYNC GET #2198

Open
githubbob42 opened this issue May 4, 2019 · 1 comment
Open

Analyze queue options for long running SYNC GET #2198

githubbob42 opened this issue May 4, 2019 · 1 comment
Labels

Comments

@githubbob42
Copy link
Owner

Mingle Card: 2441

Description

We have a few options, we could pick a specialized priority job queue library (such as Kue or monq), we can build a job queue abstraction over a message queue library/platform such a Amazon SQS, RabbitMQ, ZeroMQ etc), or we can refine our own queue to handle multiple consumers (the queue use for processing sync POST).

Acceptance Criteria

The queue mechanism we chose to work with needs:

  • Consumers - to listen in on the queue and grab messages as they come in. We'll want the consumers to be run in multiple individual processes in order to process many sync messages concurrently.
  • Atomic - you don't want multiple consumers pulling down and processing the same message multiple times. When a consumer acknowledges a message, no other consumers should be able to process it.
  • Speed - Items need to be inserted into the queue quickly (as to not hold up the sync). Equally, the need to be pulled out on the queue quickly. The actual processing time may vary but the queue needs to handle large numbers of reads/writes.
  • Error handling - if a consumer dies mid-process, the sync message it was holding should be returned to the queue for another consumer to pick up.

Analysis

Specialized library

Kue:

Kue is a priority job queue backed by redis (rather than our existing server-side data storage platform, mongodb).

Pros:

  • Rich API for handling errors, including retries with delayed back off or exponential back off support.
  • Various logging options baked in allowing us to expose information at any point in the jobs lifetime.
  • graceful shutdown - sync workers can be told to stop processing on completing the active job, or, in the case of timeout or error, will mark the active job as failed. 
  • Rich web UI to visualize the queue - can be mounted inside our express app (and secured with our existing oauth tokens). See http://www.screenr.com/oyNs for demo.
  • Allows each dyno to control how many jobs will be processed concurrently on a single worker (for instance, based on memory utilization etc).
  • documentation
  • large base of existing users

Cons:

monq:

Monq is a MongoDB-backed job queue.

Pros:

  • Backed by mongodb (our existing server-side data storage)
  • Allows for mongodb TTL (time-to-live) indexes - *should* allow us to expire queued job periodically (and potentially expire previous jobs for a given auth token such that a user only has only sync per session).
  • basic support for error handling with retries and delayed retries

Cons:

  • No ui
  • Lacking documentation (except for test coverage which is pretty good)
  • No explicit logging hooks (although it shouldn't be too hard to add our own logging)
  • No out-of-the-box support for handling a process terminating during the processing of a job
  • Small user base

refine our own queue implementation

To cover the acceptance criteria, we'd need to make a few changes to our queue to correctly handle multiple consumers.

  • Atomic operations:

We can handle the atomic requirement by using mongodbs findAndModify command in conjunction with adding a new "inProg" boolean to the message envelope. Since findAndModify obtains a write-lock on the affected database (blocking other operations until it has completed), we can safely track in-progress jobs and prevent multiple consumers picking up the same job.

To support findAndModify, we'd have to remove our current usage of tailable cursors and replace with polling as these features cannot be used together at the moment see: https://jira.mongodb.org/browse/SERVER-11753

  • Error Handling

If/when the heroku process is unexpectedly terminated while a message is being processed, we can attempt to clean up (listening in for SIGTERM, then setting inprog to false - or marking the packet with error: shutdown), however this isn't foolproof. In the case of unexpected termination, a job may be left in a stale state (in prog = true) with no corresponding worker.

As part of the findAndModify call, we can set a timestamp so that we know when the message was pulled down. Based on how long we expect consumers to take per message, we can query documents which are marked as in progress but the timestamp was some time ago.

Using this technique, we can update those documents to reset the "inProg" flag, effectively returning them back to the queue for another consumer to work on, or we can mark the document with an error.

@githubbob42
Copy link
Owner Author

Story: #2196 User initiates a sync request on the client (Mingle)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant