Skip to content
This repository has been archived by the owner on May 13, 2019. It is now read-only.

Race condition in partition rebalance. #62

Closed
nemosupremo opened this issue Jul 5, 2015 · 0 comments · Fixed by #68
Closed

Race condition in partition rebalance. #62

nemosupremo opened this issue Jul 5, 2015 · 0 comments · Fixed by #68

Comments

@nemosupremo
Copy link
Contributor

(Moved from #61)

Actually I was looking into this because I was having an issue where 2 of my nodes would stop accepting requests. I think this might be related - when my 9th node comes up one node gives up all its partitions, and another node tries to claim those partitions and fails:

It looks like this might be a data race?
Node A tries to grab 16, 17, 18, 19 and fails.

[Sarama] 2015/07/05 02:10:35 [geard/bacc9b9f50bb] Triggering rebalance due to consumer list change
[Sarama] 2015/07/05 02:10:35 [geard/bacc9b9f50bb] geard-user/14 :: Stopping partition consumer at offset -1
[Sarama] 2015/07/05 02:10:35 [geard/bacc9b9f50bb] geard-user/15 :: Stopping partition consumer at offset -1
[Sarama] 2015/07/05 02:10:35 [geard/bacc9b9f50bb] geard-user/12 :: Stopping partition consumer at offset 44
[Sarama] 2015/07/05 02:10:35 [geard/bacc9b9f50bb] geard-user/13 :: Stopping partition consumer at offset 43
[Sarama] 2015/07/05 02:10:35 consumer/broker/40770 closed dead subscription to geard-user/13
[Sarama] 2015/07/05 02:10:35 consumer/broker/40770 closed dead subscription to geard-user/14
[Sarama] 2015/07/05 02:10:35 consumer/broker/40770 closed dead subscription to geard-user/15
[Sarama] 2015/07/05 02:10:35 consumer/broker/40770 closed dead subscription to geard-user/12
[Sarama] 2015/07/05 02:10:35 [geard/bacc9b9f50bb] geard-user :: Stopped topic consumer
[Sarama] 2015/07/05 02:10:35 [geard/bacc9b9f50bb] Currently registered consumers: 9
[Sarama] 2015/07/05 02:10:35 [geard/bacc9b9f50bb] geard-user :: Started topic consumer
[Sarama] 2015/07/05 02:10:35 [geard/bacc9b9f50bb] geard-user :: Claiming 4 of 32 partitions
[Sarama] 2015/07/05 02:10:36 [geard/bacc9b9f50bb] geard-user/16 :: FAILED to claim the partition: Cannot claim partition: it is already claimed by another instance
[Sarama] 2015/07/05 02:10:36 [geard/bacc9b9f50bb] geard-user/17 :: FAILED to claim the partition: Cannot claim partition: it is already claimed by another instance
[Sarama] 2015/07/05 02:10:36 [geard/bacc9b9f50bb] geard-user/18 :: FAILED to claim the partition: Cannot claim partition: it is already claimed by another instance
[Sarama] 2015/07/05 02:10:36 [geard/bacc9b9f50bb] geard-user/19 :: FAILED to claim the partition: Cannot claim partition: it is already claimed by another instance
[Sarama] 2015/07/05 02:10:36 [geard/bacc9b9f50bb] geard-user :: Stopped topic consumer
[Sarama] 2015/07/05 02:18:46 client/metadata fetching metadata for all topics from broker 10.129.196.48:9092

Node B lets go of 16,17,18,19 possible after Node A tries to acquire it.

[Sarama] 2015/07/05 02:10:35 [geard/31c73a8faa4c] Triggering rebalance due to consumer list change
[Sarama] 2015/07/05 02:10:35 [geard/31c73a8faa4c] geard-user/16 :: Stopping partition consumer at offset -1
[Sarama] 2015/07/05 02:10:35 [geard/31c73a8faa4c] geard-user/17 :: Stopping partition consumer at offset -1
[Sarama] 2015/07/05 02:10:35 [geard/31c73a8faa4c] geard-user/18 :: Stopping partition consumer at offset -1
[Sarama] 2015/07/05 02:10:35 [geard/31c73a8faa4c] geard-user/19 :: Stopping partition consumer at offset 44
[Sarama] 2015/07/05 02:10:36 consumer/broker/40770 closed dead subscription to geard-user/18
[Sarama] 2015/07/05 02:10:36 consumer/broker/40770 closed dead subscription to geard-user/19
[Sarama] 2015/07/05 02:10:36 consumer/broker/40770 closed dead subscription to geard-user/16
[Sarama] 2015/07/05 02:10:36 consumer/broker/40770 closed dead subscription to geard-user/17
[Sarama] 2015/07/05 02:10:36 [geard/31c73a8faa4c] geard-user :: Stopped topic consumer
[Sarama] 2015/07/05 02:10:36 [geard/31c73a8faa4c] Currently registered consumers: 9
[Sarama] 2015/07/05 02:10:36 [geard/31c73a8faa4c] geard-user :: Started topic consumer
[Sarama] 2015/07/05 02:10:36 [geard/31c73a8faa4c] geard-user :: Claiming 4 of 32 partitions
[Sarama] 2015/07/05 02:10:36 [geard/31c73a8faa4c] geard-user/20 :: Partition consumer starting at offset 37.
[Sarama] 2015/07/05 02:10:36 [geard/31c73a8faa4c] geard-user/21 :: Partition consumer starting at offset 50.
[Sarama] 2015/07/05 02:10:36 consumer/broker/40770 added subscription to geard-user/20
[Sarama] 2015/07/05 02:10:36 [geard/31c73a8faa4c] geard-user/22 :: Partition consumer starting at offset 57.
[Sarama] 2015/07/05 02:10:36 [geard/31c73a8faa4c] geard-user/23 :: Partition consumer starting at offset 38.
[Sarama] 2015/07/05 02:10:36 consumer/broker/40770 added subscription to geard-user/21
[Sarama] 2015/07/05 02:10:37 consumer/broker/40770 added subscription to geard-user/23
[Sarama] 2015/07/05 02:10:37 consumer/broker/40770 added subscription to geard-user/22
[Sarama] 2015/07/05 02:18:42 client/metadata fetching metadata for all topics from broker 10.129.196.48:9092

It looks like the naive thing to do would be to possibly sleep for a second in topicListConsumer() - however using something other than Sleep to solve this race condition might be better - unfortunately I don't yet have a great understanding of how consumergroups work.

Or, retry claiming a set number of times?

wvanbergen added a commit that referenced this issue Aug 17, 2015
Retry claiming partitions if the partition is already claims. See #62
caihua-yin pushed a commit to caihua-yin/kafka that referenced this issue Apr 19, 2016
This is complementary fix for
wvanbergen#68
(issue: wvanbergen#62), before the
re-implementation (wvanbergen#72) is ready.

In my use case, the message consuming logic is sometimes time consuming,
even with 3 times retry as the fix in pull#68, it's still easy to have
the issue#62. Furhter checking current logic in
consumer_group.go:partitionConsumer(), it may take
as many as cg.config.Offsets.ProcessingTimeout to ReleasePartition
so that the partition can be claimed by new consumer during rebalance.
So just simply set the max retry time same as
cg.config.Offsets.ProcessingTimeout, which is 60s by default.

Verified this the system including this fix with frequent rebalance
operations, the issue does not occur again.
wvanbergen added a commit that referenced this issue Jun 15, 2016
Complementary fix of partition rebalnce issue(#62)
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant