HAproxy health-check causes node shutdown #54

seancribbs · 2014-01-27T21:25:03Z

HAproxy uses a naïve health-check mechanism that simply opens a TCP socket and then closes it immediately after it is accepted by the server. This causes the call to peername/1 in riak_api_pb_server to return {error, enotconn} instead of the expected {ok, Peername}. The resulting badmatch propagates to the listener process (it was waiting on a reply via gen_fsm:sync_send_event), which then crashes the application supervisor after too many restarts.

Using this example config:

listen riak_pbc_demo_10 0.0.0.0:9109
    balance leastconn
    option tcplog
    mode tcp
    server 127.0.0.1 127.0.0.1:8087 check

seancribbs · 2014-01-27T17:16:42Z

It's worth noting that this was not an issue on 1.4.x because we didn't have the security features that require checking the peername, so the process would simply shutdown normally at a time after the set_socket call.

This should address #54, where the socket is invalid even before the server receives control of it.

seancribbs · 2014-01-27T21:25:33Z

^^ @cipy

roooms · 2014-01-31T13:33:31Z

I tested two haproxy configs and I only see this issue in pre11 when option nolinger [0] is used.

Saying that, having tested the configuration @cipy is using (including nolinger) the pre11 crashes as reported, and this branch stopped the crashing behaviour. So +1

[0] http://cbonte.github.io/haproxy-dconv/configuration-1.4.html#option%20nolinger

seancribbs · 2014-01-31T15:06:21Z

Thanks @roooms. Can I get an Eng +1?

engelsanchez · 2014-01-31T15:15:02Z

src/riak_api_pb_server.erl

+    %% The call to set_socket should reply ok, but shutdown the
+    %% server, not crash and propagate back to the listener process.
+    ?assertEqual(ok, set_socket(Server, ServerSocket)),
+    ?assertNot(erlang:is_process_alive(Server)),


This looks racy. How about setting a monitor on the server, and expect a 'DOWN' message within a given timeout instead?

engelsanchez · 2014-01-31T15:39:12Z

Quick note: if you were to run the eunit test on the old code, it is skipped because of way the crash happens, not marked as failed.

engelsanchez · 2014-01-31T15:45:12Z

src/riak_api_pb_server.erl

+
+    %% Pretend that we're a listener, listen on any port
+
+    {ok, Server} = start_link(),


So probably a good idea to start without linking to the test process here to really verify what happens to it.

engelsanchez · 2014-01-31T21:27:26Z

The test looks much better, and now fails with the old code, passes with the fixes. I did not test the HAProxy configuration myself, as Dan has already and know nothing about HAProxy. I hope between my 👍 and his it's enough.

💃 ⛵

seancribbs · 2014-01-31T22:18:53Z

Thanks @engelsanchez @roooms @cipy

HAproxy health-check causes node shutdown

Address premature socket-close issue in server wait_for_socket state.

2bdd2a5

This should address #54, where the socket is invalid even before the server receives control of it.

seancribbs added 3 commits January 27, 2014 15:35

Whitespace cleanup.

2138d1d

Fix a few dialyzer bugs in riak_api_pb_server.

1d094d1

Narrow that type.

f806d5e

engelsanchez reviewed Jan 31, 2014
View reviewed changes

Make the closed-socket test more robust.

31003ec

seancribbs added a commit that referenced this pull request Jan 31, 2014

Merge pull request #54 from basho/bugfix/sdc/haproxy-enotconn

e2fbf84

HAproxy health-check causes node shutdown

seancribbs merged commit e2fbf84 into develop Jan 31, 2014

seancribbs deleted the bugfix/sdc/haproxy-enotconn branch January 31, 2014 22:18

rzezeski modified the milestones: 2.0-beta, 2.0 Mar 25, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HAproxy health-check causes node shutdown #54

HAproxy health-check causes node shutdown #54

seancribbs commented Jan 27, 2014

seancribbs commented Jan 27, 2014

seancribbs commented Jan 27, 2014

roooms commented Jan 31, 2014

seancribbs commented Jan 31, 2014

engelsanchez Jan 31, 2014

seancribbs Jan 31, 2014

engelsanchez commented Jan 31, 2014

engelsanchez Jan 31, 2014

engelsanchez commented Jan 31, 2014

seancribbs commented Jan 31, 2014


		%% Pretend that we're a listener, listen on any port

		{ok, Server} = start_link(),

HAproxy health-check causes node shutdown #54

HAproxy health-check causes node shutdown #54

Conversation

seancribbs commented Jan 27, 2014

seancribbs commented Jan 27, 2014

seancribbs commented Jan 27, 2014

roooms commented Jan 31, 2014

seancribbs commented Jan 31, 2014

engelsanchez Jan 31, 2014

Choose a reason for hiding this comment

seancribbs Jan 31, 2014

Choose a reason for hiding this comment

engelsanchez commented Jan 31, 2014

engelsanchez Jan 31, 2014

Choose a reason for hiding this comment

engelsanchez commented Jan 31, 2014

seancribbs commented Jan 31, 2014