-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Port of 1.4 AAE replication fixes to 2.0. #501
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This prevent a crash when attempting to sync a particular partition and we can't figure out who to connect to because of natting.
Provided below is a summary of bugs fixed. In the event that the tcp connection errors without completing full-sync, and when the max fssinks is not configured and defaulted to one, this causes the aae_sink process to terminate normally, but not propagate errors to the fssink process. This causes an abandoned fssink process, which locks out any future reservations for fullsync causing replication to stall indefinitely on the sink cluster. In the event that max fssink setting is not configured to one, and specified at another value, tcp errors can cause available reservation slots to be depleted over time. This problem was discovered with repl_aae_fullsync in riak_test and should be forward ported to the 2.0 release in addition to the test. Gracefully handle the not_built message from the hashtree. Ensure that when aae_source dies, we handle the DOWN message and terminate the fsssource process instead of ignoring the message. Do not throw the error, or send the complete message to the sink, when the process dies; both are not necessary. Ensure abnormal exits when the aae_source fails. Remove monitors, handle not_responsible as a independent message instead of a failure state. Provide a bound on the number of times a partition will be retried in the event of failure through the fullsync coodinator proces. If we exceed the number of retries for all queued partitions, end the fullsync. If ownership transfer is occuring, ensure we bail and nodify the fscoordinator with an abnormal exit, which will reschedule the partition. No longer attempt to gain locks on hashtrees during the initialization function of a sink process, which, in the event of failure would cause connection manager to use an exponential backoff, in addition to invalidating the endpoint preventing other partitions on the same node from being synchronized. Ensure that the AAE source process exits if the socket closes. Ensure we reserve and unreserve partitions when we detect source processes exiting.
…e coordinator retry them
Riak test shows this passes. Code is basically unchanged from 1.4, so life is 👍 |
cmeiklejohn
added a commit
that referenced
this pull request
Jan 6, 2014
Port of 1.4 AAE replication fixes to 2.0.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
@lordnull @Vagabond