Create better logic for lock handling #308

manast · 2016-06-16T16:23:55Z

Currently we have an hazard regarding job locks. It could happen that a worker working on a job, for some reason is not able to renew a lock on the job he is working on in time. When this happens the stalled jobs mechanism takes over and starts to process the same job on another worker. If the first worker has not really stalled, but its just being slow processing, the queue may end completing twice the same job.

xdc0 · 2016-06-16T18:31:13Z

Is this issue to track the fix for that reported hazard or is this for tracking a bigger improvement on job locking overall? If the later, what are your ideas on improving it?

manast · 2016-06-16T18:36:10Z

the later because I don't understand yet what is causing the reported issue.
One of the possible fixes could be to check if the current worker still has the lock on the job on handleComplete. I am also thinking that the current token mechanism is not good enough since we have one token per queue, we would need instead one per job that has started, so that we can distinguish which worker is actually locking the job.

manast · 2016-06-16T18:42:41Z

a problem with the approach above is that it could happen that a job that never manage to renew the token in the time specified could potentially always loose the lock when arriving at the handleComplete so that the job will be stalled for ever and processed for ever... not good.

xdc0 · 2016-06-16T20:47:33Z

How about revamping the stalled job feature? Bull should be able to give clients the interface needed to detect a potential stuck job and let the client decide what to do, having Bull be opinionated about what to do with these jobs is risky and a source of potential issues.

My thoughts is about allowing a ttl option for jobs, so if a job has taken longer than ttl time to finish, then emit a stalled event. Potentially releasing the lock by then, and let processStalledJobs pick it up to try again, but that should be an opt-in feature as well (Not all clients might want to retry a job, specially if it had some side-effects prior getting stuck)

If a job has no ttl, it will keep processing for as long as the queue runs. It's responsibility of the client to monitor for these (I think we already have a timestamp property for jobs).

A Bull monitoring tool would be pretty nice if it could give stats about the queue health, Matador is ok but it doesn't work quite right for high volume queues.

manast · 2016-06-16T21:21:14Z

Ok, I understand your points. The thing with process stalled jobs is that it has grow a little bit out of hand. In the beginning I needed this functionality so that jobs that were not finalised before the queue was closed for some reason would start automatically the next time the queue was started (or started by another worker in another process). In fact it is the only real use case I actually care about. If a job stalls because of a bad implementation of the process function, I do not think the queue should try to solve it for you, having a ttl that emits the stalled event is a nice feature that can help the user to detect stalled jobs and correct the underlying reason.

manast · 2016-06-16T22:13:24Z

Btw, take a look at this code (regarding atomic getting stalled jobs): 5429778
I dont think we need to do a releaseLock in the case of failure, since time will unlock it anyway... would make the code simpler, what do you think?

xdc0 · 2016-06-16T22:52:45Z

This line:
+ 'if redis.call("sismember", KEYS[1], ARGV[1]) then',
should be if not right? otherwise it would be processing completed jobs, also it should check if there is a lock first

Other than that, the lock will certainly be released naturally, the problem with that is that it will remain locked and a periodic Queue#clean won't be able to remove these until the lock expires. I've noticed that the Queue#clean operation takes a long time when trying to clean 50k+ jobs in a particular set. A bit of a tangent, but I'm working on making Queue#clean faster, I've managed to cut time in half at the moment, I'll see if I can a bit faster before pushing a PR

In any case, with this change, I don't think we'll leave locked jobs in case of failure - processJobs correctly handle locks and the script won't lock the job on failure. If you see the current implementation, that release is done as a catch of sismemberasync, because it locks before checking if it's in completed.

manast · 2016-06-17T09:52:39Z

thanks for the catch, it was late yesterday and the unit tests were passing :)
Regarding releasing the lock in case of an exception of processJobs should not affect performance so much since it only happens in cases of errors which should be very seldom, so I think it is worth to keep the code clean in this case.

viveksatasiya · 2016-06-17T10:34:25Z

Hello,

I have this similar issue in my current project. I am using this library for uploading pdf files to aws s3 bucket and file size of every pdf is around 5 MB. I am using npm s3 for uploading files with bull.

I am using concurrency 1. When I create a job for upload on pdf file, It will start uploading a file and I start reporting progress of upload but suddenly after some time stalled event fired and job has started again where my original job is still in process.

I can post here If you want a code which can replicate the things.

Please suggest me if anything I am doing wrong.

Thanks.

xdc0 · 2016-06-17T18:49:37Z

@viveksatasiya I think what you are running into is this one: #299 - This task is tracked to address the scenario you are experiencing.

manast · 2016-06-17T19:12:32Z

@viveksatasiya if you can post that reproduces this issue it would be really helpful.

viveksatasiya · 2016-06-20T04:55:50Z

@chuym yes you are correct but the solution is not there so I have posted here.

@manast here is my code.

test.js.zip

Job goes in stalled mode When I try to upload file greater than 2 MB and queue will add that job as new job. You can reproduce using this file.

Tell me if you need more information.

Thanks.

manast · 2016-06-20T10:52:33Z

@viveksatasiya thanks for the code, I can reproduce the issue now, I will start debugging and find the root cause of this once and for all.

viveksatasiya · 2016-06-20T11:48:45Z

Ok thanks. 👍

manast · 2016-06-20T12:34:06Z

@chuym interestingly I have found quite severe errors regarding the locking mechanism... strange this is not causing much more trouble in other scenarios.

manast · 2016-06-20T13:15:52Z

I released 1.0.0-rc4 please check with this new version.

xdc0 · 2016-06-20T17:13:59Z

Nice, is there a unit test that you can add that check for it? This is the commit that fixes the issue, right? 8a9e455

viveksatasiya · 2016-06-21T04:28:27Z

Ok thanks. let me check and get back to you.

viveksatasiya · 2016-06-21T04:57:22Z

The problem has gone!! Thanks a lot. Now it is working fine. I have checked a new version with multiple files size varying between 2 MB to 5 MB but every job run successfully.

Again Thanks a lot.

Appreciated!! 👍

manast self-assigned this Jun 20, 2016

xdc0 mentioned this issue Jun 20, 2016

Does the job.process function prevent a job from stalling #299

Closed

manast closed this as completed Jun 21, 2016

DevBrent mentioned this issue Apr 4, 2017

Graceful shutdown for stalled jobs ( lock renewal ) #484

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create better logic for lock handling #308

Create better logic for lock handling #308

manast commented Jun 16, 2016

xdc0 commented Jun 16, 2016

manast commented Jun 16, 2016

manast commented Jun 16, 2016

xdc0 commented Jun 16, 2016 •

edited

Loading

manast commented Jun 16, 2016 •

edited

Loading

manast commented Jun 16, 2016

xdc0 commented Jun 16, 2016

manast commented Jun 17, 2016

viveksatasiya commented Jun 17, 2016

xdc0 commented Jun 17, 2016

manast commented Jun 17, 2016

viveksatasiya commented Jun 20, 2016

manast commented Jun 20, 2016

viveksatasiya commented Jun 20, 2016

manast commented Jun 20, 2016

manast commented Jun 20, 2016 •

edited

Loading

xdc0 commented Jun 20, 2016

viveksatasiya commented Jun 21, 2016

viveksatasiya commented Jun 21, 2016

Create better logic for lock handling #308

Create better logic for lock handling #308

Comments

manast commented Jun 16, 2016

xdc0 commented Jun 16, 2016

manast commented Jun 16, 2016

manast commented Jun 16, 2016

xdc0 commented Jun 16, 2016 • edited Loading

manast commented Jun 16, 2016 • edited Loading

manast commented Jun 16, 2016

xdc0 commented Jun 16, 2016

manast commented Jun 17, 2016

viveksatasiya commented Jun 17, 2016

xdc0 commented Jun 17, 2016

manast commented Jun 17, 2016

viveksatasiya commented Jun 20, 2016

manast commented Jun 20, 2016

viveksatasiya commented Jun 20, 2016

manast commented Jun 20, 2016

manast commented Jun 20, 2016 • edited Loading

xdc0 commented Jun 20, 2016

viveksatasiya commented Jun 21, 2016

viveksatasiya commented Jun 21, 2016

xdc0 commented Jun 16, 2016 •

edited

Loading

manast commented Jun 16, 2016 •

edited

Loading

manast commented Jun 20, 2016 •

edited

Loading