Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

while_executing has problems at low concurrency #384

Closed
danpolyuha opened this issue Apr 16, 2019 · 18 comments
Closed

while_executing has problems at low concurrency #384

danpolyuha opened this issue Apr 16, 2019 · 18 comments
Assignees
Labels

Comments

@danpolyuha
Copy link

danpolyuha commented Apr 16, 2019

Describe the bug
When lock type while_executing is used with lock_timeout: nil, sometimes the system fails to lock the jobs properly and jobs are being executed in parallel.

Expected behavior
The jobs never get executed in parallel.

Worker class

class Job
  include Sidekiq::Worker

  sidekiq_options lock: :while_executing,
                  lock_timeout: nil,
                  on_conflict: :reschedule

  def perform
    puts 'hello'
    sleep 1
    puts 'bye'
  end
end

Additional context
Hi Mikael,
Thank you very much for your amazing work on this gem.

I believe I might have run into an issue here and would like to verify it with you.
I'm running the given worker using something like this:

25.times { Job.perform_async }

Now, since the job must be uniq while executing, "hello" and "bye" must always follow each other. But here is a part of what I get in sidekiq output:

2019-04-16T15:19:43.196Z 50070 TID-ovffchvmm Job JID-af6bca983c0fbcf5b22a585f INFO: start
2019-04-16T15:19:43.196Z 50070 TID-ovffchuy2 Job JID-f0af723412d0dc5c9371d0d1 INFO: start
2019-04-16T15:19:43.196Z 50070 TID-ovffchtn6 Job JID-442ba288ae6a074c84a85f50 INFO: start
hello
bye
2019-04-16T15:19:44.206Z 50070 TID-ovffchtwa Job JID-502c09235b292f57b7ecc712 INFO: done: 1.015 sec
2019-04-16T15:19:44.207Z 50070 TID-ovffchtwa Job JID-e255e583fa30d86b7e225d3d INFO: start
hello
hello
bye
bye
2019-04-16T15:19:45.214Z 50070 TID-ovffchuhu Job JID-6969ccaaac1b67c91c8a53da SidekiqUniqueJobs::Server::Middleware DIG-uniquejobs:4e839ff6799d5d8ba6a9c5d6129c4fe2 WARN: might need to be unlocked manually
2019-04-16T15:19:45.214Z 50070 TID-ovffchuhu Job JID-6969ccaaac1b67c91c8a53da INFO: done: 2.018 sec
2019-04-16T15:19:45.215Z 50070 TID-ovffchuhu Job JID-091344c663f8b08d84179cb1 INFO: start
2019-04-16T15:19:45.215Z 50070 TID-ovffchv06 Job JID-e17fd36a4e76b825bd3094fa INFO: done: 2.022 sec
2019-04-16T15:19:45.216Z 50070 TID-ovffchv06 Job JID-5b2776d26e44520af510c545 INFO: start
hello
hello
bye
bye
2019-04-16T15:19:46.221Z 50070 TID-ovffchuq6 Job JID-1380eb73bc55cd2b7003d7cb INFO: done: 3.028 sec
2019-04-16T15:19:46.222Z 50070 TID-ovffchvuu Job JID-be5ed8dcf06dbca9f21ccc5c INFO: done: 3.028 sec

As you can see, jobs are being executed in parallel from time to time.
I'm wondering, is this a bug or have I misconfigured/misunderstood something?

If it is a bug, I believe it must be the scenario when thread 1 checks whether Redis contains the lock and at this moment (before thread 1 writes the digest into Redis) thread 2 starts execution and checks the same.

Please note that everything works properly with lock_tomeout: 0.

@mhenrixon
Copy link
Owner

@danpolyuha thank you so much for the detailed report! I will try and create some failing test cases for this scenario.

@mhenrixon
Copy link
Owner

Using the latest released version I can't replicate the problem @danpolyuha.

Running 2 sidekiq worker processes with 25 workers each it seems to produce the expected results. Can you check your gem version, possibly upgrade if not on latest and either close the issue or provide some more details on your setup?

This is what my log looks like

8:39:31 PM worker.1 |  2019-04-21T18:39:31.443Z 85626 TID-oxx1j8312 Issue384Worker JID-fb16bb069a1ac614f349f579 INFO: start
8:39:31 PM worker.1 |  2019-04-21T18:39:31.444Z 85626 TID-oxx1j844i Issue384Worker JID-bb7f852535a5de2ad4f7323c INFO: start
8:39:31 PM worker.1 |  2019-04-21T18:39:31.453Z 85626 TID-oxx1j8206 Issue384Worker JID-39cf048a1570f2dd5b69fb9f INFO: start
8:39:31 PM worker.1 |  2019-04-21T18:39:31.453Z 85626 TID-oxx1j838m Issue384Worker JID-d92b24a20608104bbb7265c6 INFO: start
8:39:31 PM worker.1 |  2019-04-21T18:39:31.455Z 85626 TID-oxx1n36g6 Issue384Worker JID-77e5c0a0d73a89a2ca72a924 INFO: start
8:39:31 PM worker.1 |  2019-04-21T18:39:31.454Z 85626 TID-oxx1j81oq Issue384Worker JID-30e9b2186308d33eff62fbf8 INFO: start
8:39:31 PM worker.1 |  2019-04-21T18:39:31.455Z 85626 TID-oxx1j81fm Issue384Worker JID-48b7c86ebe6bdb5002de66e2 INFO: start
8:39:31 PM worker.1 |  2019-04-21T18:39:31.456Z 85626 TID-oxx1j82b6 Issue384Worker JID-78d5d0c7c67f30f4cda36d63 INFO: start
8:39:31 PM worker.1 |  2019-04-21T18:39:31.456Z 85626 TID-oxx1k7rpi Issue384Worker JID-d9008915ce39762a72a00c3f INFO: start
8:39:31 PM worker.1 |  2019-04-21T18:39:31.459Z 85626 TID-oxx1j82w2 Issue384Worker JID-496232cba0d2ace4f9f96962 INFO: start
8:39:31 PM worker.1 |  2019-04-21T18:39:31.457Z 85626 TID-oxx1j83l6 Issue384Worker JID-11bccfd91ee775f1a26f9d3b INFO: start
8:39:31 PM worker.1 |  2019-04-21T18:39:31.460Z 85626 TID-oxx1j849y Issue384Worker JID-278ed207e917bf1727c724b5 INFO: start
8:39:31 PM worker.1 |  2019-04-21T18:39:31.458Z 85626 TID-oxx1j82ji Issue384Worker JID-74b080c2c3848ed2bd0933c8 INFO: start
8:39:31 PM worker.1 |  2019-04-21T18:39:31.464Z 85626 TID-oxx1j84hu Issue384Worker JID-0da6474a6cb7a0cd1b094854 INFO: start
8:39:31 PM worker.1 |  2019-04-21T18:39:31.465Z 85626 TID-oxx1j83xq Issue384Worker JID-053a5309995d66fa784ee62a INFO: start
8:39:31 PM worker.1 |  2019-04-21T18:39:31.465Z 85626 TID-oxx1j81ny Issue384Worker JID-bfa805439e5bae7e99ea0803 INFO: start
8:39:31 PM worker.1 |  2019-04-21T18:39:31.459Z 85626 TID-oxx1j82nq Issue384Worker JID-1a34bbf1324e6f02c4918a26 INFO: start
8:39:31 PM worker.1 |  2019-04-21T18:39:31.464Z 85626 TID-oxx1khfj6 Issue384Worker JID-364784f2cdb8af3f3fc12f7c INFO: start
8:39:31 PM worker.1 |  2019-04-21T18:39:31.464Z 85626 TID-oxx1hwm1e Issue384Worker JID-230008abbb966714264a4434 INFO: start
8:39:31 PM worker.1 |  2019-04-21T18:39:31.464Z 85626 TID-oxx1j83da Issue384Worker JID-ae331a5db380f0624fe767c7 INFO: start
8:39:31 PM worker.1 |  2019-04-21T18:39:31.464Z 85626 TID-oxx2cbuje Issue384Worker JID-75cad948e82c45a2a61ebe6f INFO: start
8:39:31 PM worker.1 |  2019-04-21T18:39:31.456Z 85626 TID-oxx1ph4jq Issue384Worker JID-8efb64d7c6e90a52f7dd4da7 INFO: start
8:39:31 PM worker.1 |  2019-04-21T18:39:31.460Z 85626 TID-oxx2cbtyi Issue384Worker JID-3f2b00f963dae2f4e2eb8cd0 INFO: start
8:39:31 PM worker.1 |  2019-04-21T18:39:31.465Z 85626 TID-oxx1j82o6 Issue384Worker JID-e7569d31cd8fbef7bf177679 INFO: start
8:39:31 PM worker.1 |  2019-04-21T18:39:31.460Z 85626 TID-oxx1j81yy Issue384Worker JID-c425197d66be155263993ae1 INFO: start
8:39:31 PM worker.1 |  hello
8:39:32 PM worker.1 |  bye
8:39:32 PM worker.1 |  2019-04-21T18:39:32.727Z 85626 TID-oxx1j849y Issue384Worker JID-278ed207e917bf1727c724b5 INFO: done: 1.267 sec
8:39:32 PM worker.1 |  hello
8:39:33 PM worker.1 |  bye
8:39:33 PM worker.1 |  2019-04-21T18:39:33.729Z 85626 TID-oxx1j81oq Issue384Worker JID-30e9b2186308d33eff62fbf8 INFO: done: 2.276 sec
8:39:33 PM worker.1 |  hello
8:39:34 PM worker.1 |  bye
8:39:34 PM worker.1 |  2019-04-21T18:39:34.734Z 85626 TID-oxx1j83xq Issue384Worker JID-053a5309995d66fa784ee62a INFO: done: 3.269 sec
8:39:34 PM worker.1 |  hello
8:39:35 PM worker.1 |  bye
8:39:35 PM worker.1 |  2019-04-21T18:39:35.741Z 85626 TID-oxx2cbtyi Issue384Worker JID-3f2b00f963dae2f4e2eb8cd0 INFO: done: 4.282 sec
8:39:35 PM worker.1 |  hello
8:39:36 PM worker.1 |  bye
8:39:36 PM worker.1 |  2019-04-21T18:39:36.746Z 85626 TID-oxx1j82b6 Issue384Worker JID-78d5d0c7c67f30f4cda36d63 INFO: done: 5.291 sec
8:39:36 PM worker.1 |  hello
8:39:37 PM worker.1 |  bye
8:39:37 PM worker.1 |  2019-04-21T18:39:37.754Z 85626 TID-oxx1j81yy Issue384Worker JID-c425197d66be155263993ae1 INFO: done: 6.294 sec
8:39:37 PM worker.1 |  hello
8:39:38 PM worker.1 |  bye
8:39:38 PM worker.1 |  2019-04-21T18:39:38.758Z 85626 TID-oxx1j8206 Issue384Worker JID-39cf048a1570f2dd5b69fb9f INFO: done: 7.305 sec
8:39:38 PM worker.1 |  hello
8:39:39 PM worker.1 |  bye
8:39:39 PM worker.1 |  2019-04-21T18:39:39.765Z 85626 TID-oxx1hwm1e Issue384Worker JID-230008abbb966714264a4434 INFO: done: 8.3 sec
8:39:39 PM worker.1 |  hello
8:39:40 PM worker.1 |  bye
8:39:40 PM worker.1 |  2019-04-21T18:39:40.772Z 85626 TID-oxx1j81fm Issue384Worker JID-48b7c86ebe6bdb5002de66e2 INFO: done: 9.317 sec
8:39:40 PM worker.1 |  hello
8:39:41 PM worker.1 |  bye
8:39:41 PM worker.1 |  2019-04-21T18:39:41.777Z 85626 TID-oxx1n36g6 Issue384Worker JID-77e5c0a0d73a89a2ca72a924 INFO: done: 10.322 sec
8:39:41 PM worker.1 |  hello
8:39:42 PM worker.1 |  bye
8:39:42 PM worker.1 |  2019-04-21T18:39:42.784Z 85626 TID-oxx1j838m Issue384Worker JID-d92b24a20608104bbb7265c6 INFO: done: 11.33 sec
8:39:42 PM worker.1 |  hello
8:39:43 PM worker.1 |  bye
8:39:43 PM worker.1 |  2019-04-21T18:39:43.792Z 85626 TID-oxx1j82o6 Issue384Worker JID-e7569d31cd8fbef7bf177679 INFO: done: 12.327 sec
8:39:43 PM worker.1 |  hello
8:39:44 PM worker.1 |  bye
8:39:44 PM worker.1 |  2019-04-21T18:39:44.798Z 85626 TID-oxx1k7rpi Issue384Worker JID-d9008915ce39762a72a00c3f INFO: done: 13.342 sec
8:39:44 PM worker.1 |  hello
8:39:45 PM worker.1 |  bye
8:39:45 PM worker.1 |  2019-04-21T18:39:45.801Z 85626 TID-oxx1j83l6 Issue384Worker JID-11bccfd91ee775f1a26f9d3b INFO: done: 14.344 sec
8:39:45 PM worker.1 |  hello
8:39:46 PM worker.1 |  bye
8:39:46 PM worker.1 |  2019-04-21T18:39:46.808Z 85626 TID-oxx1ph4jq Issue384Worker JID-8efb64d7c6e90a52f7dd4da7 INFO: done: 15.352 sec
8:39:46 PM worker.1 |  hello
8:39:47 PM worker.1 |  bye
8:39:47 PM worker.1 |  2019-04-21T18:39:47.816Z 85626 TID-oxx1j83da Issue384Worker JID-ae331a5db380f0624fe767c7 INFO: done: 16.352 sec
8:39:47 PM worker.1 |  hello
8:39:48 PM worker.1 |  bye
8:39:48 PM worker.1 |  2019-04-21T18:39:48.819Z 85626 TID-oxx1j82nq Issue384Worker JID-1a34bbf1324e6f02c4918a26 INFO: done: 17.36 sec
8:39:48 PM worker.1 |  hello
8:39:49 PM worker.1 |  bye
8:39:49 PM worker.1 |  2019-04-21T18:39:49.824Z 85626 TID-oxx1j8312 Issue384Worker JID-fb16bb069a1ac614f349f579 INFO: done: 18.381 sec
8:39:49 PM worker.1 |  hello
8:39:50 PM worker.1 |  bye
8:39:50 PM worker.1 |  2019-04-21T18:39:50.829Z 85626 TID-oxx2cbuje Issue384Worker JID-75cad948e82c45a2a61ebe6f INFO: done: 19.364 sec
8:39:50 PM worker.1 |  hello
8:39:51 PM worker.1 |  bye
8:39:51 PM worker.1 |  2019-04-21T18:39:51.832Z 85626 TID-oxx1j82w2 Issue384Worker JID-496232cba0d2ace4f9f96962 INFO: done: 20.373 sec
8:39:51 PM worker.1 |  hello
8:39:52 PM worker.1 |  bye
8:39:52 PM worker.1 |  2019-04-21T18:39:52.835Z 85626 TID-oxx1j81ny Issue384Worker JID-bfa805439e5bae7e99ea0803 INFO: done: 21.37 sec
8:39:52 PM worker.1 |  hello
8:39:53 PM worker.1 |  bye
8:39:53 PM worker.1 |  2019-04-21T18:39:53.842Z 85626 TID-oxx1khfj6 Issue384Worker JID-364784f2cdb8af3f3fc12f7c INFO: done: 22.378 sec
8:39:53 PM worker.1 |  hello
8:39:54 PM worker.1 |  bye
8:39:54 PM worker.1 |  2019-04-21T18:39:54.847Z 85626 TID-oxx1j84hu Issue384Worker JID-0da6474a6cb7a0cd1b094854 INFO: done: 23.384 sec
8:39:54 PM worker.1 |  hello
8:39:55 PM worker.1 |  bye
8:39:55 PM worker.1 |  2019-04-21T18:39:55.854Z 85626 TID-oxx1j82ji Issue384Worker JID-74b080c2c3848ed2bd0933c8 INFO: done: 24.396 sec
8:39:55 PM worker.1 |  hello
8:39:56 PM worker.1 |  bye
8:39:56 PM worker.1 |  2019-04-21T18:39:56.860Z 85626 TID-oxx1j844i Issue384Worker JID-bb7f852535a5de2ad4f7323c INFO: done: 25.416 sec

@danpolyuha
Copy link
Author

danpolyuha commented Apr 22, 2019

@mhenrixon ,
I'm using the latest version of the gem (6.0.13), Sidekiq (5.2.6), and Redis (5.0.4).
The issue is still reproducible.
For the cleanness of the experiment, can you try to run a single Sidekiq process like

sidekiq -r ./job.rb

and check if everything still works fine on your side?
If it does, I can provide you with any details about my setup you need, just tell me what can be of help.

If we still can't reproduce it on your side, can you please describe very briefly the logic behind locks in while_executing? I tried to dive into your gem's code, but Lua scripts are rather complicated and I'd got lost. Maybe I can find the reason locally if I know where to look.

@mhenrixon
Copy link
Owner

mhenrixon commented Apr 22, 2019

@danpolyhua if you coul check the configuration I tested with and let me know what is different that would be awesome:

@mhenrixon
Copy link
Owner

What operating system and redis server version are you on? It could be that your redis server version has trouble with some Lua script feature maybe?

@mhenrixon
Copy link
Owner

I've tried with a various number of sleeps but they all yield the expected results

@danpolyuha
Copy link
Author

@mhenrixon ,
I'm using the latest version of the gem (6.0.13), Sidekiq (5.2.6), and Redis (5.0.4) on MacOS 10.14.4.

I see that you're testing everything in Rails. Can you please try to run it in vanilla environment with nothing but Sidekiq and sidekiq-unique-jobs?

I have a folder with only two files:
job.rb

require 'sidekiq'
require 'sidekiq-unique-jobs'

class Job
  include Sidekiq::Worker

  sidekiq_options lock: :while_executing,
                  lock_timeout: nil,
                  on_conflict: :reschedule

  def perform
    puts 'hello'
    sleep 1
    puts 'bye'
  end
end

go.rb

require_relative 'job'

25.times { Job.perform_async }

Then I'm running Sidekiq:
sidekiq -r ./job.rb
and Ruby file:
ruby ./go.rb

Maybe your sidekiq.yml or sidekiq.rb are somehow messing with it, it's hard to say.

@mhenrixon
Copy link
Owner

It seems that by default sidekiq has concurrency 10 and I have it set to 25 in my sidekiq.yml if I lower it to 10 I can definitely replicate the problem.

Thank you for the investigation.

@mhenrixon mhenrixon changed the title Lock "while_executing" might have problems with "lock_timeout: nil" while_executing has problems at low concurrency Apr 23, 2019
@mhenrixon mhenrixon self-assigned this Apr 23, 2019
@mhenrixon mhenrixon added the bug label Apr 23, 2019
@danpolyuha
Copy link
Author

Great news.
Please let me know if I can be of any help.

@mhenrixon
Copy link
Owner

Do you have any ideas on how to replicate this in a test?

@danpolyuha
Copy link
Author

Not really, unfortunately.

@danpolyuha
Copy link
Author

I believe it can be somehow replicated on the level of unit tests close to Lua scripts, but I don't have enough understanding about how your system works to do it.

@mhenrixon
Copy link
Owner

Fascinating... I actually started writing tests against the Lua scripts yesterday

@meenie
Copy link

meenie commented May 7, 2019

I'm experiencing the same thing. I believe it's directly tied to the Sidekiq concurrency option set and how many jobs you add to the queue all at once. I set the concurrency to 10 and added 10 jobs, it was fine. If I add 11 or more, it causes the issue described.

@mhenrixon
Copy link
Owner

Which is the opposite of what I would expect, c'est la vie. Anyway, I'm working on fixing this problem once and for all.

@danpolyuha
Copy link
Author

Hi @mhenrixon,
I hate bothering you, but are you still working on this?
Thanks!

@mhenrixon
Copy link
Owner

I sure am @danpolyuha. I believe it has been fixed already but the fix will be in V7 which is close to ready but not completely done yet. If you are feeling adventurous you could give the master branch a try and see if that fixes your problem.

@mhenrixon
Copy link
Owner

@danpolyuha there has been numerous releases of v7.0.0.beta(n) I suggest you give it a go and open new issues for any problems you encounter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants