Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve on 5 min AAE+ownership change stall #971

Merged
merged 1 commit into from
Jun 4, 2014

Conversation

engelsanchez
Copy link
Contributor

This is a proposed improvement to address #846

The hard coded default has been changed from 5 minutes to 1 minute.
Also, usage of the FSM timeout mechanism that could lead to the FSM
ending up stuck forever has been changed to gen_fsm:send_event_after

The hard coded default has been changed from 5 minutes to 1 minute.
Also, usage of the FSM timeout mechanism that could lead to the FSM
ending up stuck forever has been changed to gen_fsm:send_event_after
@cmeiklejohn
Copy link
Contributor

I'll take review of this one.

@cmeiklejohn cmeiklejohn self-assigned this Jun 4, 2014
@jtuple jtuple assigned jtuple and unassigned cmeiklejohn Jun 4, 2014
@jtuple
Copy link
Contributor

jtuple commented Jun 4, 2014

+1 95b076e

I think we should consider making this timeout configurable (advanced.config only). However, a 1 minute hardcoded timeout seems entirely safe, and is a significant improvement over the 5 minute timeout.

I agree with the changes to using explicit timers rather state transition timeouts. I'm certain the way AAE is written/used today there is no issue, but inactivity timeouts are extremely brittle and easy for us to unknowingly break in future unrelated changes. So, +1 on that change.

We don't have a test for this specific condition, and right now isn't the time to invest in it. However, we should consider extending existing AAE tests to cover the timeout case in the future.

In any case, I ran the existing set of AAE riak_tests against this change, as well as manually tested the timeout by suspending several riak_kv_index_hashtree processes using erlang:suspend_process. The timeout works as expected, and shows up after 1 minute. So, all good.

I think this is good enough for now, at least for this current pull-request. Although, adding an explicit test, making the timeout configurable, and properly fixing the underlying issue behind #846 are all reasonable future work items.

borshop added a commit that referenced this pull request Jun 4, 2014
Improve on 5 min AAE+ownership change stall

Reviewed-by: jtuple
@engelsanchez
Copy link
Contributor Author

Thanks Joe. Notice that it is already configurable and I'm just changing the hard coded default: riak_kv,anti_entropy_timeout.

@engelsanchez
Copy link
Contributor Author

@borshop merge

@borshop borshop merged commit 95b076e into develop Jun 4, 2014
@seancribbs seancribbs deleted the bugfix/aae-ownership-5-min-stall branch April 1, 2015 23:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants