-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug in combination with HAproxy on FreeBSD #460
Comments
I added a few more details on Versions and the content of the haproxy file. |
Thanks for the updates! Bandit 1.5.8 never existed; do you mean to say 1.5.7? Are you saying that HAProxy 2.8 & Bandit 1.5.(7?) works as expected, and all other combinations do not? It would be really useful to know the smallest possible update to either / both libraries that causes the problem to occur; if possible, could you try to bisect for a smaller version change that reproduces the issue? A fair bit of keepalive logic changed in Bandit around then, which is why I'm keen to try and narrow it down further. |
Hi and thanks @mtrudel for the reply. Sorry.... correction that was a typo.
Yes, I will try to narrow it down for a smaller version change. But probably won't have the time to do so before the weekend. I will give an update here! |
Okay, I had some time and tried to get back to a state when Bandit was actually working in this setup. And the surprising outcome was:
I downgraded to all versions of Bandit down to 1.0.0. In no combination I was able to get rid of the behavior which still is: The Phoenix application, is working totally fine after deployment or restart of the Elixir Release via I am completely stuck with my latin why that is happening and start to doubt myself if it actually was ever working - my git history says, I changed from Cowboy to Bandit October 2nd. And then haven't had any issues for months. The app was totally working fine. I am at the end with my latin its some super weird combination probably of network stack, OS, erlang elixir... that leaves no traces... I guess I stop investigate that for now, and periodically do update everything... maybe this hiccup will solve by itself magically in the future....? Or @mtrudel are there any configuration options for Bandit to play around with? |
At least I now found maybe a hint: I assume that is because haproxy upfront terminates connections.
It also appears only, when navigating to a LiveView with socket connection and then back to a sole GET page request
|
When trying to get more logging for my debugging task I recognized the following error:
Its still in the documentation but was maybe removed? UPDATE... we'll actually its still in the keys... |
UPDATE: More logging With advanced error logging the above error looks like that:
Also when getting Process.info of the Bandit Process this looks normal to me: [
current_function: {:gen_server, :loop, 7},
initial_call: {:proc_lib, :init_p, 5},
status: :waiting,
message_queue_len: 0,
links: [#PID<0.1931.0>, #PID<0.1932.0>, #PID<0.2234.0>, #PID<0.1925.0>],
dictionary: [
"$initial_call": {:supervisor, ThousandIsland.Server, 1},
"$ancestors": [PlaygroundWeb.Endpoint, Playground.Supervisor,
#PID<0.1895.0>]
],
trap_exit: true,
error_handler: :error_handler,
priority: :normal,
group_leader: #PID<0.1894.0>,
total_heap_size: 6771,
heap_size: 2586,
stack_size: 11,
reductions: 9485,
garbage_collection: [
max_heap_size: %{
error_logger: true,
include_shared_binaries: false,
kill: true,
size: 0
},
min_bin_vheap_size: 46422,
min_heap_size: 233,
fullsweep_after: 65535,
minor_gcs: 2
],
suspending: []
] Also interesting: When I kill the Bandit process the running application manually (in the state that Bandit stopped working), then the supervisor obviously restarts Bandit and Bandit is immediately serving again... |
Hi @mtrudel, sorry to bother you so much in this issue - but finally I found the cause! I can reproduce the behavior (Phoenix App with Bandit being unresponsive after about 1h of uptime). So it's not especially a Bandit problem. its HaProxy thing - at least the cause is a configuration in haproxy. The SetupThe described behavior can be reproduced with the following haproxy config (simplified version):
To mention here are two things:
So what happens...In order to see, what's going on between haproxy and 192.168.0.10:8080, I checked with tpcdump on the interface - and that's what we get (a three second excerpt) 19:44:16.929709 IP 192.168.0.0.37370 > 192.168.0.10.http-alt: Flags [S], seq 2975014247, win 65535, options [mss 16344,nop,wscale 6,sackOK,TS val 1107760135 ecr 0], length 0
19:44:16.929726 IP 192.168.0.10.http-alt > 192.168.0.0.37370: Flags [S.], seq 3149284940, ack 2975014248, win 65535, options [mss 16344,nop,wscale 6,sackOK,TS val 3581736569 ecr 1107760135], length 0
19:44:16.929736 IP 192.168.0.0.37370 > 192.168.0.10.http-alt: Flags [.], ack 1, win 1277, options [nop,nop,TS val 1107760135 ecr 3581736569], length 0
19:44:16.929764 IP 192.168.0.0.37370 > 192.168.0.10.http-alt: Flags [R.], seq 1, ack 1, win 0, options [nop,nop,TS val 1107760135 ecr 3581736569], length 0
19:44:18.944042 IP 192.168.0.0.25977 > 192.168.0.10.http-alt: Flags [S], seq 3733863170, win 65535, options [mss 16344,nop,wscale 6,sackOK,TS val 534740319 ecr 0], length 0
19:44:18.944127 IP 192.168.0.10.http-alt > 192.168.0.0.25977: Flags [S.], seq 3111848782, ack 3733863171, win 65535, options [mss 16344,nop,wscale 6,sackOK,TS val 2072667181 ecr 534740319], length 0
19:44:18.944154 IP 192.168.0.0.25977 > 192.168.0.10.http-alt: Flags [.], ack 1, win 1277, options [nop,nop,TS val 534740319 ecr 2072667181], length 0
19:44:18.944190 IP 192.168.0.0.25977 > 192.168.0.10.http-alt: Flags [R.], seq 1, ack 1, win 0, options [nop,nop,TS val 534740319 ecr 2072667181], length 0
19:44:20.965536 IP 192.168.0.0.52806 > 192.168.0.10.http-alt: Flags [S], seq 1840030346, win 65535, options [mss 16344,nop,wscale 6,sackOK,TS val 3195020104 ecr 0], length 0
19:44:20.965554 IP 192.168.0.10.http-alt > 192.168.0.0.52806: Flags [S.], seq 2846094430, ack 1840030347, win 65535, options [mss 16344,nop,wscale 6,sackOK,TS val 3078994868 ecr 3195020104], length 0
19:44:20.965566 IP 192.168.0.0.52806 > 192.168.0.10.http-alt: Flags [.], ack 1, win 1277, options [nop,nop,TS val 3195020104 ecr 3078994868], length 0
19:44:20.965605 IP 192.168.0.0.52806 > 192.168.0.10.http-alt: Flags [R.], seq 1, ack 1, win 0, options [nop,nop,TS val 3195020104 ecr 3078994868], length 0 So what we see is:
And this happens - because haproxy checks if the server is alive over and over every second... My interpretationI am not totally sure who brings in the misunderstanding, haproxy or Bandit...
How to fix itBesides choosing more reasonable 50000 for global
So its like that now:
Now everything works fine, and the server was up for over 24 hours and is still working as desired... As I assume, that the TCP
Probably this also fixes the issue, and connections are properly closed again as its now UPDATE on changing health checks to http => Yes, this also works properly. No connection overflow anymore, when activating What I do not get.
... puh I am glad that I found the cause at the end. Cheers, |
This is some great detective work @marschro ! I've got a repro case setup locally based on your info; I don't see any obvious solution (mostly because all of this happens before Bandit/Thousand Island is even involved), but I'll keep digging. |
Got it. The issue originated in how Thousand Island acceptors handled invalid sockets (which, for whatever reason, is how the runtime surfaces HAProxy TCP checks). This actually may explain a few other longstanding heisenbugs as well. Fix coming to Thousand Island in a few minutes. |
Bump your Thousand Island dip to >= 1.3.12 and you should be golden! |
Thanks again for all the detective work! |
Hi all,
I face some issue with Bandit used in a phoenix app in combination with freebsd jail and HAProxy.
I originally started investigating this issue in the elixir forum - starting Jan. 25.
USED VERSIONS
OS: FreeBSD RELEASE-14.2
HaProxy: 3.0.x (LTS)
Elixir: 1.17.3
Phoenix: 1.7.19
Erlang: Erlang/OTP 26 [erts-14.2.5.5] [source] [64-bit] [smp:2:2] [ds:2:2:10] [async-threads:1] [jit:ns] [dtrace] [sharing-preserving] [amd64-portbld-freebsd14.1]
Elixir App is unchanged fresh
mix phx.new my_app
HaProxy config is pretty simple:
DESCRIPTION
phx.new
- Bandit is used as default WebsServer - without further dependencies or changesRELEASE-14.2
Failed to connect to 192.168.0.10 port 8080 after 0 ms: Could not connect to server
OBSERVATIONS
:observer_cli
tells me that there is all fine. No blocked queues nor any significant memory consumption. Also no changes in the data before the app gets unresponsive and after. So app is healthy.What's really interesting is that the app is reachable for pretty much around 60 minutes and then stops serving.
I was able to narrow it down to probably an issue between latest Bandit and HAProxy. Because if I disconnect the app from HAProxy - means deactivate the HAProxy Frontend and Backend, then the app is working fine again (but only tested with curl - not included the whole socket magic that might also be an issue).
WORKAROUND AND FIXES
If there are any further details I could provide, I am happy to help.
I also started to investigate the codebase of Bandit, but that domain is currently a bit above my knowledge...
Kind regards,
Martin
The text was updated successfully, but these errors were encountered: