Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add logic to clear cancelled connection requests #598

Merged
merged 4 commits into from
Jun 5, 2014

Conversation

bsparrow435
Copy link
Contributor

This branch adds logic to clear cancelled connections from the
riak_core_connection_manager Pending state. I have also added a few
helper functions to give a better view into request
states(get_request_states) and connection
errors(get_connection_errrors).

Connections are cancelled only when a user runs riak-repl disconnect NAME/[IP,PORT]. I’ve added logic to this path to call
remove_cancelled_connections after waiting cm_cancellation_interval.
This interval must either be an integer or the atom never.

This logic only makes one attempt to remove the cancelled connection.
It might be nice to have a helper function that we can call to manually
clear the connections but currently this can be done by re-trying
riak-repl disconnect.

The riak_test showing this issue is replication2_connections.erl and
the issue can be demonstrated with this code by setting the
cm_cancellation_interval to never.

This branch adds logic to clear cancelled connections from the
riak_core_connection_manager Pending state. I have also added a few
helper functions to give a better view into request
states(get_request_states) and connection
errors(get_connection_errrors).

Connections are cancelled only when a user runs `riak-repl disconnect
NAME/[IP,PORT]`. I’ve added logic to this path to  call
remove_cancelled_connections after waiting cm_cancellation_interval.
This interval must either be an integer or the atom never.

This logic only makes one attempt to remove the cancelled connection.
It might be nice to have a helper function that we can call to manually
clear the connections but currently this can be done by re-trying
`riak-repl disconnect`.

The riak_test showing this issue is `replication2_connections.erl` and
the issue can be demonstrated with this code by setting the
`cm_cancellation_interval` to `never`.
@cmeiklejohn
Copy link
Contributor

Dialyzer is returning the following problems:

riak_core_connection_mgr.erl:541: The pattern 'ok' can never match the type {'error',atom()}

@cmeiklejohn cmeiklejohn added this to the 2.0-RC milestone Jun 2, 2014
@cmeiklejohn cmeiklejohn self-assigned this Jun 2, 2014
@bsparrow435
Copy link
Contributor Author

Oh god, the connection_helper, i hate this stupid function. It's supposed to start and then when finished hit the 'EXIT' handle_info in the riak_core_connection_mgr to iterate stats and do some other logic. However, i've seen it totally skip this when riak_core_connection:sync_connect returns an {error, Reason} which is one reason the stats module is useless. This helper function needs to be totally reworked.

With that said, the call at that line is riak_core_connection:sync_connect who's spec is -spec sync_connect(ip_addr(), clientspec()) -> ok | {error, atom()}.

I don't see the issue here.

@cmeiklejohn
Copy link
Contributor

Test for this behavior: basho/riak_test#626

@cmeiklejohn
Copy link
Contributor

@bsparrow435 How are we looking here?

%% The helper process will discover the cancellation when it asks if it
%% should connect to an endpoint.
State#state{pending = lists:keystore(Req#req.ref, #req.ref, Pending,
Req#req{state = cancelled})}
end.

%% remove pending connection from state for supplied Ref.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing @doc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the rest of the module, most functions are missing @doc and type specifications. I'll add these to my new functions but would it be ok to go ahead and add the rest in this PR?

Changes the atom for canceling cleared connections from `never` to
`undefined`. Cleaned up logic and added `@doc` to one new function.
@bsparrow435
Copy link
Contributor Author

Just pushed commit addressing comments. Working on associated riak_test now.

%% The helper process will discover the cancellation when it asks if it
%% should connect to an endpoint.
State#state{pending = lists:keystore(Req#req.ref, #req.ref, Pending,
Req#req{state = cancelled})}
end.

%% @doc remove pending connection from state for supplied Ref.
remove_cancelled_connection(Ref, State = #state{pending = Pending}) ->
case lists:keytake(Ref, #req.ref, Pending) of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't see how this will ever match, given that the second parameter is supposed to be an integer. Does this grab the default value from the record?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did it this way because a few lines above we did the same thing in lists:keystore. My understanding is that #req.ref will return an integer representing it's place in the record. I'll try testing that now to confirm.

Looking at the rest of the module and other modules in riak_repl this is fairly common in all lists:key* operations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just confirmed this behavior with the following immediately below line 431 of this file:

lager:notice("BRIAN RETURN: ~p",[#req.ref]),

Result:

2014-06-04 18:50:57.294 [notice] <0.2467.0>@riak_core_connection_mgr:disconnect_from_target:431 BRIAN RETURN: 2

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, very interesting. Thanks for taking the time to explain why it works.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No big deal now, but that is definitely something I've never seen before. It's clever, but I think too much so.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went to the docs, from the bottom of the Records page: http://www.erlang.org/doc/reference_manual/records.html

In addition, #Record.Name returns the index in the tuple representation of Name of the record Record. Name must be an atom.

So this is an explicit, by design behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems the reason this works is because people were tired of defining indexes for elements in a record.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I know it's by design. I was arguing it was obtuse, given two Erlang programmers who have been doing it for 2+ years didn't know what it was doing, so we were making an argument to just put the integer value in (or macro).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, and I know this is code you inherited, @bsparrow435, so not asking for it to be changed now. Might as well be consistent within the module until we fix the whole mess.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just leave it as is and we'll address all of it later.

@cmeiklejohn
Copy link
Contributor

I'm still concerned given this code introduced a new Dialyzer error which is not on master which should be fixed prior to merge. Otherwise, once the Dialyzer warning is fixed, and the test refactored, we're good to go here.

@bsparrow435
Copy link
Contributor Author

I just found an issue where, if connected/disconnected in the correct order within 5 minutes, the endpoint(ex. "127.0.0.1":"10046") will be blacklisted forever. I need to add some logic to reset the retry_backoff. Will work on it tonight.

@bsparrow435
Copy link
Contributor Author

Also will be adding another riak_test to show issue

@bsparrow435
Copy link
Contributor Author

False alarm, i wasn't waiting long enough for the endpoint backoff to finish. Still, going to add one more step to the riak_test to prove reconnection is possible as well as adding logic to reset the backoff timer when the cancelled connection is removed.

@bsparrow435
Copy link
Contributor Author

Dialyzer issue is the last open item:

riak_core_connection_mgr.erl:546: The pattern 'ok' can never match the type {'error',atom()}

Myself and @cmeiklejohn are looking into it now.

@cmeiklejohn
Copy link
Contributor

@reiddraper found it; warning was is in the ignore file, but line number changed. Just push a commit to update the line number.

@bsparrow435
Copy link
Contributor Author

Nice find, thanks.

@cmeiklejohn
Copy link
Contributor

Pushed d61ff7e with updated line numbers.

@cmeiklejohn
Copy link
Contributor

👍 d61ff7e

borshop added a commit that referenced this pull request Jun 5, 2014
…ions

Add logic to clear cancelled connection requests

Reviewed-by: cmeiklejohn
@cmeiklejohn
Copy link
Contributor

@borshop merge

@borshop borshop merged commit d61ff7e into develop Jun 5, 2014
@cmeiklejohn cmeiklejohn deleted the bugfix/bjs/clear-cancelled-connections branch June 5, 2014 20:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants