Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HAproxy health-check causes node shutdown #54

Merged
merged 5 commits into from
Jan 31, 2014

Conversation

seancribbs
Copy link
Contributor

HAproxy uses a naïve health-check mechanism that simply opens a TCP socket and then closes it immediately after it is accepted by the server. This causes the call to peername/1 in riak_api_pb_server to return {error, enotconn} instead of the expected {ok, Peername}. The resulting badmatch propagates to the listener process (it was waiting on a reply via gen_fsm:sync_send_event), which then crashes the application supervisor after too many restarts.

Using this example config:

listen riak_pbc_demo_10 0.0.0.0:9109
    balance leastconn
    option tcplog
    mode tcp
    server 127.0.0.1 127.0.0.1:8087 check

@seancribbs
Copy link
Contributor Author

It's worth noting that this was not an issue on 1.4.x because we didn't have the security features that require checking the peername, so the process would simply shutdown normally at a time after the set_socket call.

This should address #54, where the socket is invalid even before the
server receives control of it.
@seancribbs
Copy link
Contributor Author

^^ @cipy

@roooms
Copy link

roooms commented Jan 31, 2014

I tested two haproxy configs and I only see this issue in pre11 when option nolinger [0] is used.

Saying that, having tested the configuration @cipy is using (including nolinger) the pre11 crashes as reported, and this branch stopped the crashing behaviour. So +1

[0] http://cbonte.github.io/haproxy-dconv/configuration-1.4.html#option%20nolinger

@seancribbs
Copy link
Contributor Author

Thanks @roooms. Can I get an Eng +1?

%% The call to set_socket should reply ok, but shutdown the
%% server, not crash and propagate back to the listener process.
?assertEqual(ok, set_socket(Server, ServerSocket)),
?assertNot(erlang:is_process_alive(Server)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks racy. How about setting a monitor on the server, and expect a 'DOWN' message within a given timeout instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea.

@engelsanchez
Copy link
Contributor

Quick note: if you were to run the eunit test on the old code, it is skipped because of way the crash happens, not marked as failed.


%% Pretend that we're a listener, listen on any port

{ok, Server} = start_link(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So probably a good idea to start without linking to the test process here to really verify what happens to it.

@engelsanchez
Copy link
Contributor

The test looks much better, and now fails with the old code, passes with the fixes. I did not test the HAProxy configuration myself, as Dan has already and know nothing about HAProxy. I hope between my 👍 and his it's enough.

💃 ⛵

@seancribbs
Copy link
Contributor Author

Thanks @engelsanchez @roooms @cipy

seancribbs added a commit that referenced this pull request Jan 31, 2014
HAproxy health-check causes node shutdown
@seancribbs seancribbs merged commit e2fbf84 into develop Jan 31, 2014
@seancribbs seancribbs deleted the bugfix/sdc/haproxy-enotconn branch January 31, 2014 22:18
@rzezeski rzezeski modified the milestones: 2.0-beta, 2.0 Mar 25, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants