Port of 1.4 AAE replication fixes to 2.0. #501

cmeiklejohn · 2014-01-02T22:04:06Z

This prevent a crash when attempting to sync a particular partition and we can't figure out who to connect to because of natting.

Provided below is a summary of bugs fixed. In the event that the tcp connection errors without completing full-sync, and when the max fssinks is not configured and defaulted to one, this causes the aae_sink process to terminate normally, but not propagate errors to the fssink process. This causes an abandoned fssink process, which locks out any future reservations for fullsync causing replication to stall indefinitely on the sink cluster. In the event that max fssink setting is not configured to one, and specified at another value, tcp errors can cause available reservation slots to be depleted over time. This problem was discovered with repl_aae_fullsync in riak_test and should be forward ported to the 2.0 release in addition to the test. Gracefully handle the not_built message from the hashtree. Ensure that when aae_source dies, we handle the DOWN message and terminate the fsssource process instead of ignoring the message. Do not throw the error, or send the complete message to the sink, when the process dies; both are not necessary. Ensure abnormal exits when the aae_source fails. Remove monitors, handle not_responsible as a independent message instead of a failure state. Provide a bound on the number of times a partition will be retried in the event of failure through the fullsync coodinator proces. If we exceed the number of retries for all queued partitions, end the fullsync. If ownership transfer is occuring, ensure we bail and nodify the fscoordinator with an abnormal exit, which will reschedule the partition. No longer attempt to gain locks on hashtrees during the initialization function of a sink process, which, in the event of failure would cause connection manager to use an exponential backoff, in addition to invalidating the endpoint preventing other partitions on the same node from being synchronized. Ensure that the AAE source process exits if the socket closes. Ensure we reserve and unreserve partitions when we detect source processes exiting.

…e coordinator retry them

lordnull · 2014-01-06T20:34:24Z

Riak test shows this passes. Code is basically unchanged from 1.4, so life is 👍

Port of 1.4 AAE replication fixes to 2.0.

cmeiklejohn and others added 9 commits January 2, 2014 16:54

Return error when we can't find IP locally.

9de0157

This prevent a crash when attempting to sync a particular partition and we can't figure out who to connect to because of natting.

Fix unreserve functionality.

4c92c85

AAE source should not try to handle already_locked partitions, let th…

41daf38

…e coordinator retry them

Set the default fscoordinator retry limit to infinity

ce63aa4

Add comment.

9a29a2e

Remove comments.

c6cc006

Fix match.

156a8e3

Add comment about handle_call.

a18464f

cmeiklejohn added a commit that referenced this pull request Jan 6, 2014

Merge pull request #501 from basho/feature/csm/aae-repl-develop

03d6508

Port of 1.4 AAE replication fixes to 2.0.

cmeiklejohn merged commit 03d6508 into develop Jan 6, 2014

seancribbs deleted the feature/csm/aae-repl-develop branch April 1, 2015 23:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Port of 1.4 AAE replication fixes to 2.0. #501

Port of 1.4 AAE replication fixes to 2.0. #501

cmeiklejohn commented Jan 2, 2014

lordnull commented Jan 6, 2014

Port of 1.4 AAE replication fixes to 2.0. #501

Port of 1.4 AAE replication fixes to 2.0. #501

Conversation

cmeiklejohn commented Jan 2, 2014

lordnull commented Jan 6, 2014