Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create better logic for lock handling #308

Closed
manast opened this issue Jun 16, 2016 · 19 comments
Closed

Create better logic for lock handling #308

manast opened this issue Jun 16, 2016 · 19 comments
Assignees

Comments

@manast
Copy link
Member

manast commented Jun 16, 2016

Currently we have an hazard regarding job locks. It could happen that a worker working on a job, for some reason is not able to renew a lock on the job he is working on in time. When this happens the stalled jobs mechanism takes over and starts to process the same job on another worker. If the first worker has not really stalled, but its just being slow processing, the queue may end completing twice the same job.

@xdc0
Copy link
Contributor

xdc0 commented Jun 16, 2016

Is this issue to track the fix for that reported hazard or is this for tracking a bigger improvement on job locking overall? If the later, what are your ideas on improving it?

@manast
Copy link
Member Author

manast commented Jun 16, 2016

the later because I don't understand yet what is causing the reported issue.
One of the possible fixes could be to check if the current worker still has the lock on the job on handleComplete. I am also thinking that the current token mechanism is not good enough since we have one token per queue, we would need instead one per job that has started, so that we can distinguish which worker is actually locking the job.

@manast
Copy link
Member Author

manast commented Jun 16, 2016

a problem with the approach above is that it could happen that a job that never manage to renew the token in the time specified could potentially always loose the lock when arriving at the handleComplete so that the job will be stalled for ever and processed for ever... not good.

@xdc0
Copy link
Contributor

xdc0 commented Jun 16, 2016

How about revamping the stalled job feature? Bull should be able to give clients the interface needed to detect a potential stuck job and let the client decide what to do, having Bull be opinionated about what to do with these jobs is risky and a source of potential issues.

My thoughts is about allowing a ttl option for jobs, so if a job has taken longer than ttl time to finish, then emit a stalled event. Potentially releasing the lock by then, and let processStalledJobs pick it up to try again, but that should be an opt-in feature as well (Not all clients might want to retry a job, specially if it had some side-effects prior getting stuck)

If a job has no ttl, it will keep processing for as long as the queue runs. It's responsibility of the client to monitor for these (I think we already have a timestamp property for jobs).

A Bull monitoring tool would be pretty nice if it could give stats about the queue health, Matador is ok but it doesn't work quite right for high volume queues.

@manast
Copy link
Member Author

manast commented Jun 16, 2016

Ok, I understand your points. The thing with process stalled jobs is that it has grow a little bit out of hand. In the beginning I needed this functionality so that jobs that were not finalised before the queue was closed for some reason would start automatically the next time the queue was started (or started by another worker in another process). In fact it is the only real use case I actually care about. If a job stalls because of a bad implementation of the process function, I do not think the queue should try to solve it for you, having a ttl that emits the stalled event is a nice feature that can help the user to detect stalled jobs and correct the underlying reason.

@manast
Copy link
Member Author

manast commented Jun 16, 2016

Btw, take a look at this code (regarding atomic getting stalled jobs): 5429778
I dont think we need to do a releaseLock in the case of failure, since time will unlock it anyway... would make the code simpler, what do you think?

@xdc0
Copy link
Contributor

xdc0 commented Jun 16, 2016

This line:
+ 'if redis.call("sismember", KEYS[1], ARGV[1]) then',
should be if not right? otherwise it would be processing completed jobs, also it should check if there is a lock first

Other than that, the lock will certainly be released naturally, the problem with that is that it will remain locked and a periodic Queue#clean won't be able to remove these until the lock expires. I've noticed that the Queue#clean operation takes a long time when trying to clean 50k+ jobs in a particular set. A bit of a tangent, but I'm working on making Queue#clean faster, I've managed to cut time in half at the moment, I'll see if I can a bit faster before pushing a PR

In any case, with this change, I don't think we'll leave locked jobs in case of failure - processJobs correctly handle locks and the script won't lock the job on failure. If you see the current implementation, that release is done as a catch of sismemberasync, because it locks before checking if it's in completed.

@manast
Copy link
Member Author

manast commented Jun 17, 2016

thanks for the catch, it was late yesterday and the unit tests were passing :)
Regarding releasing the lock in case of an exception of processJobs should not affect performance so much since it only happens in cases of errors which should be very seldom, so I think it is worth to keep the code clean in this case.

@viveksatasiya
Copy link

Hello,

I have this similar issue in my current project. I am using this library for uploading pdf files to aws s3 bucket and file size of every pdf is around 5 MB. I am using npm s3 for uploading files with bull.

I am using concurrency 1. When I create a job for upload on pdf file, It will start uploading a file and I start reporting progress of upload but suddenly after some time stalled event fired and job has started again where my original job is still in process.

I can post here If you want a code which can replicate the things.

Please suggest me if anything I am doing wrong.

Thanks.

@xdc0
Copy link
Contributor

xdc0 commented Jun 17, 2016

@viveksatasiya I think what you are running into is this one: #299 - This task is tracked to address the scenario you are experiencing.

@manast
Copy link
Member Author

manast commented Jun 17, 2016

@viveksatasiya if you can post that reproduces this issue it would be really helpful.

@viveksatasiya
Copy link

@chuym yes you are correct but the solution is not there so I have posted here.

@manast here is my code.

test.js.zip

Job goes in stalled mode When I try to upload file greater than 2 MB and queue will add that job as new job. You can reproduce using this file.

Tell me if you need more information.

Thanks.

@manast
Copy link
Member Author

manast commented Jun 20, 2016

@viveksatasiya thanks for the code, I can reproduce the issue now, I will start debugging and find the root cause of this once and for all.

@manast manast self-assigned this Jun 20, 2016
@viveksatasiya
Copy link

Ok thanks. 👍

@manast
Copy link
Member Author

manast commented Jun 20, 2016

@chuym interestingly I have found quite severe errors regarding the locking mechanism... strange this is not causing much more trouble in other scenarios.

@manast
Copy link
Member Author

manast commented Jun 20, 2016

I released 1.0.0-rc4 please check with this new version.

@xdc0
Copy link
Contributor

xdc0 commented Jun 20, 2016

Nice, is there a unit test that you can add that check for it? This is the commit that fixes the issue, right? 8a9e455

@viveksatasiya
Copy link

Ok thanks. let me check and get back to you.

@viveksatasiya
Copy link

The problem has gone!! Thanks a lot. Now it is working fine. I have checked a new version with multiple files size varying between 2 MB to 5 MB but every job run successfully.

Again Thanks a lot.

Appreciated!! 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants