Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port of 1.4 AAE replication fixes to 2.0. #501

Merged
merged 9 commits into from
Jan 6, 2014

Conversation

cmeiklejohn
Copy link
Contributor

cmeiklejohn and others added 9 commits January 2, 2014 16:54
This prevent a crash when attempting to sync a particular partition and
we can't figure out who to connect to because of natting.
Provided below is a summary of bugs fixed.

In the event that the tcp connection errors without completing
full-sync, and when the max fssinks is not configured and defaulted to
one, this causes the aae_sink process to terminate normally, but not
propagate errors to the fssink process.  This causes an abandoned fssink
process, which locks out any future reservations for fullsync causing
replication to stall indefinitely on the sink cluster.

In the event that max fssink setting is not configured to one, and
specified at another value, tcp errors can cause available reservation
slots to be depleted over time.

This problem was discovered with repl_aae_fullsync in riak_test and
should be forward ported to the 2.0 release in addition to the test.

Gracefully handle the not_built message from the hashtree.

Ensure that when aae_source dies, we handle the DOWN message and
terminate the fsssource process instead of ignoring the message.

Do not throw the error, or send the complete message to the sink, when
the process dies; both are not necessary.

Ensure abnormal exits when the aae_source fails.

Remove monitors, handle not_responsible as a independent message instead
of a failure state.

Provide a bound on the number of times a partition will be retried in
the event of failure through the fullsync coodinator proces.  If we
exceed the number of retries for all queued partitions, end the
fullsync.

If ownership transfer is occuring, ensure we bail and nodify the
fscoordinator with an abnormal exit, which will reschedule the
partition.

No longer attempt to gain locks on hashtrees during the initialization
function of a sink process, which, in the event of failure would cause
connection manager to use an exponential backoff, in addition to
invalidating the endpoint preventing other partitions on the same node
from being synchronized.

Ensure that the AAE source process exits if the socket closes.

Ensure we reserve and unreserve partitions when we detect source
processes exiting.
@lordnull
Copy link
Contributor

lordnull commented Jan 6, 2014

Riak test shows this passes. Code is basically unchanged from 1.4, so life is 👍

cmeiklejohn added a commit that referenced this pull request Jan 6, 2014
Port of 1.4 AAE replication fixes to 2.0.
@cmeiklejohn cmeiklejohn merged commit 03d6508 into develop Jan 6, 2014
@seancribbs seancribbs deleted the feature/csm/aae-repl-develop branch April 1, 2015 23:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants