Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jobs stuck in delayed state indefinitely #1656

Closed
paweltatarczuk opened this issue Feb 5, 2023 · 20 comments
Closed

Jobs stuck in delayed state indefinitely #1656

paweltatarczuk opened this issue Feb 5, 2023 · 20 comments

Comments

@paweltatarczuk
Copy link

I have a queue with mostly delayed jobs which get stuck in delayed state indefinitely. Restarting the worker solves the issue temporarily but it re-occurs on a regular basis.

I was trying to replicate the issue but the cause is unclear to me and I'm not sure how to proceed with investing what is happening here.

From what I understand everything stalls when worker is waiting for a job and calls BRPOPLPUSH:

const jobId = await client.brpoplpush(
  this.keys.wait,
  this.keys.active,
  blockTimeout,
);

The client list shows there's a connection for this command:

id=387154 addr=[<redacted>]:44672 laddr=[<redacted>]:25060 fd=15 name= age=19775 idle=19775 flags=b db=0 sub=0 psub=0 ssub=0 multi=-1 qbuf=85 qbuf-free=0 argv-mem=105 multi-mem=0 rbs=1024 rbp=0 obl=0 oll=0 omem=0 tot-mem=2025 events=r cmd=brpoplpush user=default redir=-1 resp=2

There are no errors being emitted by the worker too. It looks like BRPOPLPUSH is not timing out properly and unless a new job in active / wait state is added or worker is restarted it never ends.

Is there anything I can do to prevent this from happening or to find the root cause of this issue? I'm happy to work on the PR once it clear what is causing this behavior.


Redis version: 7.0.5 (DigitalOcean Managed Redis)
Bull MQ version: 3.5.11

@manast
Copy link
Contributor

manast commented Feb 5, 2023

I have tracked down a possible reason for this issue to this other issue: redis/ioredis#1718
A fix to workaround this issue (by default at least): #1654
Something failed on the release script, so it has not yet been released...

@manast
Copy link
Contributor

manast commented Feb 5, 2023

Ok, I checked now and it seems that a new release has been performed, version 3.6.2 should solve this issue, but only if you do not use a custom connection retry strategy that retries very quickly, within 100-50ms or so.

@paweltatarczuk
Copy link
Author

I'll upgrade to the version 3.6.2 and post an update here if this fixes the issue, which most probably is the case here! Thank you!

@simas-b
Copy link

simas-b commented Feb 6, 2023

Ok, I checked now and it seems that a new release has been performed, version 3.6.2 should solve this issue, but only if you do not use a custom connection retry strategy that retries very quickly, within 100-50ms or so.

I am experiencing an issue that seems related. In my case jobs don't get processed in time after temporary disconnection to Redis happens. I have made a demo repo to demonstrate the bug. I have updated bullmq to 3.6.2 but the problem is still there: https://github.com/butavicius/bullmq-test

@manast
Copy link
Contributor

manast commented Feb 6, 2023

@butavicius In your repo it seems like jobs are processed after reconnection, If this is the case then it is not the same issue.

@manast
Copy link
Contributor

manast commented Feb 6, 2023

@butavicius I have tried your repo and I cannot reproduce the issue, but I have not used 5 minutes repetition rate as that takes too long time, but 5 seconds works as it should. Is it only reproducible with 5 minutes repetitions? Btw, can you create an issue for this?

@paweltatarczuk
Copy link
Author

@manast Unfortunately 3.6.2 did not solve the issue, this must be something else then, I'll try to reproduce it locally and share a repo.

@manast
Copy link
Contributor

manast commented Feb 6, 2023

@paweltatarczuk strange, because what you write in the issue matches 100% the symptoms of this issue, i.e. BRPOPLPUSH hanging forever and the worker correctly reconnected.

@paweltatarczuk
Copy link
Author

I'll investigate this in more depth within upcoming days and get back with, I hope, new information. The worst part here is that it only happens on production, randomly, after couple of hours, so reproducing this properly is difficult.

@simas-b
Copy link

simas-b commented Feb 6, 2023

@butavicius In your repo it seems like jobs are processed after reconnection, If this is the case then it is not the same issue.

You are right, my issue seems to be separate. Leaving a link to it in case someone follows: #1658

@paweltatarczuk
Copy link
Author

paweltatarczuk commented Feb 8, 2023

This happened again but this time I double checked if there is an active connection for the queue and there is not. I can see multiple brpoplpush when I check client list but there is none for the queue that is stalled. I'm looking at clients' names here, assuming the name should be bull:<base64 encoded name>. I can see clients for other queues but not the stalled one.

Does it mean that worker is not running? I have a heath check in place that checks if worker.isRunning() but it's not failing. If that's the case could it be the reconnection issue?

@manast
Copy link
Contributor

manast commented Feb 8, 2023

@paweltatarczuk what settings are you using for your Worker connections?

@manast
Copy link
Contributor

manast commented Feb 8, 2023

There is a new chapter in the documentation that have some information regarding reconnections that could be useful to know in these cases: https://docs.bullmq.io/guide/going-to-production#automatic-reconnections

@paweltatarczuk
Copy link
Author

I'm using IORedis directly to set up a "shared" connection with maxRetriesPerRequest set to null as stated in the docs here. Here's a sample code based on my application:

// Job
const job = {
  name: "Some job name",
  async run() {
    // ... 
  },
};

// Connection
const connection = new IORedis(
  process.env.REDIS_CONNECTION ?? "redis://127.0.0.1:6379",
  {
    maxRetriesPerRequest: null,
  }
);

// Queue
const queue = new Queue(job.name, {
  connection,
  defaultJobOptions: {
    attempts: Number.MAX_SAFE_INTEGER,
    backoff: {
      type: "exponential",
      delay: 1000,
    },
  },
});

// Worker
const worker = new Worker(
  job.name,
  async () => {
    return await job.run();
  },
  {
    connection,
    autorun: false,
  }
);

worker.on("error", () => {
  // Capture errors with Sentry
});

worker.on("failed", () => {
  // Capture failures with Sentry
});

// Bootstrap
void worker.run();
await worker.waitUntilReady();

// Health-check
setInterval(() => {
  if (!worker.isRunning()) {
    // Terminate application
  }
}, 1000);

@paweltatarczuk
Copy link
Author

paweltatarczuk commented Feb 10, 2023

Here's a repository that reproduces the issue:
https://github.com/paweltatarczuk/bullmq-1656

It freezes every time I run it:

Running job: Test 1
connect ECONNREFUSED 127.0.0.1:56379
connect ECONNREFUSED 127.0.0.1:56379
connect ECONNREFUSED 127.0.0.1:56379
connect ECONNREFUSED 127.0.0.1:56379
connect ECONNREFUSED 127.0.0.1:56379
connect ECONNREFUSED 127.0.0.1:56379
connect ECONNREFUSED 127.0.0.1:56379
connect ECONNREFUSED 127.0.0.1:56379
connect ECONNREFUSED 127.0.0.1:56379
connect ECONNREFUSED 127.0.0.1:56379
connect ECONNREFUSED 127.0.0.1:56379
connect ECONNREFUSED 127.0.0.1:56379
connect ECONNREFUSED 127.0.0.1:56379
connect ECONNREFUSED 127.0.0.1:56379
connect ECONNREFUSED 127.0.0.1:56379
connect ECONNREFUSED 127.0.0.1:56379
connect ECONNREFUSED 127.0.0.1:56379
connect ECONNREFUSED 127.0.0.1:56379
connect ECONNREFUSED 127.0.0.1:56379
connect ECONNREFUSED 127.0.0.1:56379
connect ECONNREFUSED 127.0.0.1:56379
connect ECONNREFUSED 127.0.0.1:56379
connect ECONNREFUSED 127.0.0.1:56379
connect ECONNREFUSED 127.0.0.1:56379

In my case worker freezes right after the first job and never picks the second one, it should after 10 seconds. Now that I'm able to replicate this issue I'll try to investigate this deeper over the weekend.

@manast
Copy link
Contributor

manast commented Feb 10, 2023

Thanks for taking the time to provide an easy to reproduce test code. What I could see is that if you change to this slower retry strategy, it works properly:

const connection = new IORedis(
  process.env.REDIS_CONNECTION ?? "redis://127.0.0.1:6379",
  {
    maxRetriesPerRequest: null,
    retryStrategy: (times) => {
      return 2000;
    }
  }
);

so it seems like the default retryStrategy provided by IORedis already suffers from this issue, which is not good at all: redis/ioredis#1718

@manast
Copy link
Contributor

manast commented Feb 10, 2023

I think I can create a workaround for these kind of issues, like making sure the retry time is always larger than 2 seconds, but it may not work at all times since we do not know yet the underlying cause of the issue, it may be a hazard that manifests itself easily with fast retries but it could also be the case that it manifest itself in other situations even with a larger retry time.

@paweltatarczuk
Copy link
Author

I can confirm that adjusting the retry strategy "resolves" the issue. I'll keep an eye on redis/ioredis#1718 then.

@paweltatarczuk
Copy link
Author

Turns out the retryStrategy workaround wasn't enough. I've experienced the same issue over the weekend. Yesterday I updated ioredis to 5.3.1, no issues since then. This looks promising!

@manast
Copy link
Contributor

manast commented Feb 19, 2023

Closing this now since the issue has been resolved in ioredis in version 5.3.1

@manast manast closed this as completed Feb 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants