-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Jobs get stuck after Redis reconnect #1873
Comments
I tested the code above (I just changed the timeout to be 1 second for faster testing), it is working perfectly on Mac os, I tried many times to stop the redis server and re-start it and it always continue processing as soon as the redis server is online. Is there anything else you can share in order to reproduce the issue? |
@manast It's strange only on one instance I could reproduce, but other places I could not reproduce. I will try to investigate more on this. However, there is a similar issue we found which is reproducible everywhere. The scenario is below:
I tried analyzing the code and found that inside the https://github.com/OptimalBits/bull/blob/develop/lib/queue.js#L148 |
okok, so you mean if redis is not running when bull starts, I will look into it. |
@royalpinto @manast yep, I have the same issue. |
+1 |
I marked this as an enhancement because I consider it a new requirement, i.e. being able to work when starting without an available redis connection if that connection becomes available at a later time. May require some non trivial refactoring. |
Actually, I can reproduce this scenario
I think this is a bug. Do you have any workaround for that @manast ? |
I have met the same issue. when redis server came into crash for a long time, longer than redis reconnecting times , the job just add into the queue and just waiting,never start |
@koiszzz I cannot find how many times ioredis reconnects before it gives up, maybe it tries to reconnect forever or maybe not, this should be investigated, however I tested extensively reconnects in latest version of Bull and could not reproduce any issue. If anybody does please post some instructions on how to reproduce it, ideally a piece of code I can run so I can take a deeper look into it. |
I post an bull test project bull-test, |
@koiszzz I tested your code. I guess you are testing using a redis instance inside docker? I experience a bug with ioredis reported here: redis/ioredis#1285 where reconnect does not work if redis is running inside a docker container in MacOS at least. Your repo works well for me if I use a local redis instance but it does not work if I use a redis docker instance... So if this is your case I suggest you ping Luin to see if he can reproduce it, he has failed to reproduce it so far but if more people than me can reproduce it, then maybe it will get more attention... alternatively that I debug ioredis myself but not sure when I will find time for that... |
Thanks for replying, @manast . The redis server is installed in docker. Maybe I should install a redis server without docker for test. |
Lol.... I had the same issue, I was reading your comments and doing some tests. I realised that when running redis via docker, the redis data has no persistence and is therefore Bull jobs are deleted when the container is stopped and restarted... The "solution" is to start redis via docker-compose using a volume :) |
I encountered this error and have been investigating, I'll share our findings here - maybe it helps. As far as I can understand, when the client reconnects, The problem (as I understand it) is that when Our patch fix for this (at least locally - not tested in production yet) is basically to call const Queue = require('bull')
const Redis = require('ioredis')
const client = new Redis({
host: 'localhost',
port: 6379
})
const queue = new Queue('test', client)
let reconnectingEvent = false
client.on('ready', () => {
console.log('client ready')
if (reconnectingEvent) { // this is actually the fix, we just set a flag to not kick run unless we are restarting
console.log('rerun queue run')
reconnectingEvent = false
queue.run(queue.concurrency)
}
})
client.on('connect', () => {
console.log('client connect')
})
client.on('close', () => {
console.log('client close')
})
client.on('end', error => {
console.log('client end', error)
})
client.on('reconnecting', error => {
console.log('client reconnecting', error)
reconnectingEvent = true
})
client.on('error', error => {
console.log('client error', error.message)
})
queue
.on('completed', (job, results) => {
console.log('queue completed a job')
})
.on('error', async (error) => {
console.error('queue error', error.message)
})
.on('failed', async (job, error) => {
console.error('queue error', error.message)
})
queue.process(async (job) => {
console.log('running task')
})
setInterval(function () {
queue.add({}).then(() => console.log('pushed task')).catch((e) => console.log('failed to push task'))
}, 10000) I'm not sure if this is a "good" solution, but it works for now. Maybe maintainers have a better clue on how to fix this cleanly and properly - or describe to which I can PR a fix for. @tarunbatra do you have anything to add or do you feel this sufficiently describes our findings? |
This sounds like a good solution actually, we just need to make sure that the old "loop" is not running when the "ready" event is fired by ioredis, so basically we should "cancel" previous loop proactively when starting the new one. |
@manast How would you approach this? Is there something we can check today (without monkey patching the lib) for an existing running loop? |
I have the same issue using heroku, when it restart all queues get stuck in waiting. |
@miwnwski yeah it is a bit tricky, you need to somehow refactor the "run" method so that it can be cancellable https://github.com/OptimalBits/bull/blob/develop/lib/queue.js#L864 |
I have the same issue using GCP. Any quick solution about that. |
Faced the same issue using AWS Elasticache. Looks like there was a connection interruption or some kind of a switch. No error logs or failed jobs, discovered the issue by missing data in the database. |
+1 |
We are also encountering this issue. |
Hi, I'm working in a kubernetes cluster. Having the same issue. My workaround is to let the queue processor container die whenever it cannot connect to redis container. Kubernetes will take care of restarting my processor pod. const redisOptions = {
...config.get('redis.connectParams'),
retryStrategy: (times) => {
console.log('could not connect to redis!');
process.exit(1);
},
}; |
Currently in dev and have the same problem. Appears after wake my system from hibernate. I am using a remote Redis server, so the connection gets dropped after a while when the system sleeps :-) ioredis will reconnect automatically, but Bull won't pull new jobs. |
Hi there. I do have some logs on our open source application https://github.com/1024pix/pix-editor
Hope this helps? I'm available to debug anything |
My MongoDB driver fires the disconnect event when I wake my system from sleep. It will directly reconnect. Hm, changed a while ago from Node Redis to ioredis. I think Node Redis hasn't this problem. Anyway, this doesn't help:
|
I am running into this issue as well when facing any kind of redis reconnection. This is very painful since the service appears healthy but in fact never processes any jobs. Any workaround available? Thanks. |
@manast same happens when I am closing the queue and reopening it. New jobs get added, but they don't get processed - no wonder without a redis connection.
|
Hello @xairoo x, |
@anhnhatuit my test code above forces a blocking brpoplpush |
@xairoo I do not understand what you mean with "same happens when I am closing the queue and reopening it." ? What do you mean by closing and reopening. And can you or can't you reproduce the issue with the code I provided?, if so, would you mind to explain step by step how to reproduce it? |
Hello @manast,
Thanks |
@anhnhatuit as far as I know, yes. |
Many thanks @manast , Thanks, |
@manast yes I can reproduce it with your code and a remote redis server. I only changed this:
Anyway, this shouldn't matter. What I mean with closing and reconnecting
I played a bit around and came to The listeners are dead (disconnected), I think that's the problem.
^^ not working because the 2 required redis connections for this are missing. |
I'm going to attempt to sum up the above conversations. Bull is getting stuck after the Redis server disconnects then reconnects again. It affects many Bull users as those disconnects happen from time to time in a cloud env.
But this is no longer the case as of version 4 of Workaround for now: add these options explicitly (tested and confirmed by @manast, who is a maintanier of this repo ) |
🎉 This issue has been resolved in version 4.0.0 🎉 The release is available on: Your semantic-release bot 📦🚀 |
Doesn't work. If the connection gets dropped by redis TCP keepalive, the listeners I think the listeners should check the connection based on an interval and reconnect if the connection was dropped. That will solve the problem. |
This solution doesn't work for me. The issue is still there |
According to ioredis documentation, with the settings provided this should not happen, and I am unable to reproduce it. |
@xairoo and @EPecherkin since it seems like you can reproduce the issue easily, can you open a new issue for your specific use case and steps on how to reproduce it so that I can look into it? |
Hello @manast, Thanks, |
Give this a try:
|
@manast I can confirm it works for me for the environment given here redis/ioredis#1285 (comment) when the ioredis gets reconnected. |
After dozen tests, I found the problem was cause by our redis server's deploy-way. LOL. Our redis server was run on k8s with original redis's config -- obviously there is no persistent data when redis server restarts. I did not have time to dig why redis's data lost cause the problem but it is no any problem happens when I modify redis data dir to persistence volumn. |
I'm still facing the same issue, bull process stops consuming new messages upon reset connection error from redis : setting these flags explicitly didn't help :
I'm using Bull's version : It seems like this issue is happening only when redis throws Does anyone found a work around on how to fix this? or do I need to update to any other version of Bull? Kindly let me know as I'm really stuck, thanks. |
Description
When the Redis gets disconnected and connected again, the queue doesn't pick up any jobs. (Easy To Reproduce)
Minimal, Working Test code to reproduce the issue.
Bull version
Bull 3.18.0
Node 12 and 14
Redis 3.2.8
Steps to reproduce.
Things we tried
ioredis
. https://github.com/luin/ioredis#auto-reconnect and https://github.com/luin/ioredis#reconnect-on-error But issue don't seem to be coming fromioredis
.The text was updated successfully, but these errors were encountered: