-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create better logic for lock handling #308
Comments
Is this issue to track the fix for that reported hazard or is this for tracking a bigger improvement on job locking overall? If the later, what are your ideas on improving it? |
the later because I don't understand yet what is causing the reported issue. |
a problem with the approach above is that it could happen that a job that never manage to renew the token in the time specified could potentially always loose the lock when arriving at the |
How about revamping the stalled job feature? Bull should be able to give clients the interface needed to detect a potential stuck job and let the client decide what to do, having Bull be opinionated about what to do with these jobs is risky and a source of potential issues. My thoughts is about allowing a If a job has no ttl, it will keep processing for as long as the queue runs. It's responsibility of the client to monitor for these (I think we already have a timestamp property for jobs). A Bull monitoring tool would be pretty nice if it could give stats about the queue health, Matador is ok but it doesn't work quite right for high volume queues. |
Ok, I understand your points. The thing with process stalled jobs is that it has grow a little bit out of hand. In the beginning I needed this functionality so that jobs that were not finalised before the queue was closed for some reason would start automatically the next time the queue was started (or started by another worker in another process). In fact it is the only real use case I actually care about. If a job stalls because of a bad implementation of the process function, I do not think the queue should try to solve it for you, having a ttl that emits the stalled event is a nice feature that can help the user to detect stalled jobs and correct the underlying reason. |
Btw, take a look at this code (regarding atomic getting stalled jobs): 5429778 |
This line: Other than that, the lock will certainly be released naturally, the problem with that is that it will remain locked and a periodic In any case, with this change, I don't think we'll leave locked jobs in case of failure - |
thanks for the catch, it was late yesterday and the unit tests were passing :) |
Hello, I have this similar issue in my current project. I am using this library for uploading pdf files to aws s3 bucket and file size of every pdf is around 5 MB. I am using npm s3 for uploading files with bull. I am using concurrency 1. When I create a job for upload on pdf file, It will start uploading a file and I start reporting progress of upload but suddenly after some time stalled event fired and job has started again where my original job is still in process. I can post here If you want a code which can replicate the things. Please suggest me if anything I am doing wrong. Thanks. |
@viveksatasiya I think what you are running into is this one: #299 - This task is tracked to address the scenario you are experiencing. |
@viveksatasiya if you can post that reproduces this issue it would be really helpful. |
@viveksatasiya thanks for the code, I can reproduce the issue now, I will start debugging and find the root cause of this once and for all. |
Ok thanks. 👍 |
@chuym interestingly I have found quite severe errors regarding the locking mechanism... strange this is not causing much more trouble in other scenarios. |
I released 1.0.0-rc4 please check with this new version. |
Nice, is there a unit test that you can add that check for it? This is the commit that fixes the issue, right? 8a9e455 |
Ok thanks. let me check and get back to you. |
The problem has gone!! Thanks a lot. Now it is working fine. I have checked a new version with multiple files size varying between 2 MB to 5 MB but every job run successfully. Again Thanks a lot. Appreciated!! 👍 |
Currently we have an hazard regarding job locks. It could happen that a worker working on a job, for some reason is not able to renew a lock on the job he is working on in time. When this happens the stalled jobs mechanism takes over and starts to process the same job on another worker. If the first worker has not really stalled, but its just being slow processing, the queue may end completing twice the same job.
The text was updated successfully, but these errors were encountered: