Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retrieve timed_out jobs in O(1) #21

Closed
joe-at-startupmedia opened this issue Oct 26, 2021 · 3 comments
Closed

Retrieve timed_out jobs in O(1) #21

joe-at-startupmedia opened this issue Oct 26, 2021 · 3 comments

Comments

@joe-at-startupmedia
Copy link

Hello Dave, great tool here. Here is my first issue. Jobs which have timed out, and no longer retried, are moved to the ended list. There is no remaining index (redis key) to retrieve jobs which have a) timed out and b) have zero retries left without querying each job with a timed_out status. Ideally it would be nice to achieve this in constant time not order of (n). The solution here would be to add another redis list called timed_out which would allow you to retrieve the jobs which have timed_out. There is the failed list but they do not end up there after the retries have been exhausted. Thoughts?

@davechallis
Copy link
Owner

Yup, I can see that would be useful. I'll have more of a think about this, as I'd like to improve on monitoring/tracking of jobs (including being able to e.g. find all timed out jobs), which would help here.

I've got notes somewhere on a few solutions I was considering (separate queues based on final status as you suggested was one of them, maintaining an index somewhere to allow O(1) retrieval was another), I'll see if I can dig those out sometime.

@joe-at-startupmedia
Copy link
Author

This is how I solved it. Note that my branch is not merge-worthy due to my prefacing the key names with the namespace.
joe-at-startupmedia@ed9ca25

@joe-at-startupmedia
Copy link
Author

I've noticed however that in addition to simply adding the job id to the timedout list, several other scenarios need accounted for:

  1. Should we allow jobs that are timed_out to be expired from the system? The point here is that they've basically failed and I'd personally like to keep a persistent record of what failed without automatically purging based on the value specified for expiry_check_internal. This way I can later manually requeue the job if need be (as explained in Best way to requeue a timed_out job. #23)
  2. The delete_in_pipe function also needs to clean the job id from the timedout list. Called by both check_job_expiry and when deleting a job.
    https://github.com/joe-at-startupmedia/ocypod/blob/bf9775b7e3c862e8c9446b7fa1c0036fc8660cb6/src/application/job.rs#L556
  3. The requeue function needs to clean the job id from the timedout list.
    https://github.com/joe-at-startupmedia/ocypod/blob/bf9775b7e3c862e8c9446b7fa1c0036fc8660cb6/src/application/job.rs#L174
  4. The cancel function needs to clean the job id from the timedout list. Called by setting the status to cancelled. https://ocypod.readthedocs.io/en/latest/api/#patch-jobjob_id
    https://github.com/joe-at-startupmedia/ocypod/blob/bf9775b7e3c862e8c9446b7fa1c0036fc8660cb6/src/application/job.rs#L199
  5. Adding the TIMEDOUT_KEY to the server_info function? Called by the GET/ info request. The problem here is that the timed_out job id would present in both the new timedout list as well as the ended so it's unnecessary.
    https://github.com/joe-at-startupmedia/ocypod/blob/bf9775b7e3c862e8c9446b7fa1c0036fc8660cb6/src/application/manager.rs#L82
  6. Adding the TIMEDOUT_KEY to the job_ids function? Called by the GET /queue/{queue_name}/job_ids request. The problem here is that the timed_out job id would present in both the new timedout list as well as the ended list resulting in duplicates.
    https://github.com/joe-at-startupmedia/ocypod/blob/bf9775b7e3c862e8c9446b7fa1c0036fc8660cb6/src/application/queue.rs#L192

Other concerns:

  1. The actual name of the key should take more thought as there is already a designated use for timed_out which is a job status. Further TIMEDOUT is two words and thus deserves an underscore separation.
  2. Should the job id be present in both ended and timedout. The problem with this is that it will result in duplicates being present upon calling job_ids or info as explained in items 5 and 6 above.

Example of duplicates:

curl localhost:8023/queue/[queue_name]/job_ids
{"running":[],"completed":[],"cancelled":[],"timed_out":[3,3],"queued":[],"failed":[]}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants