Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug in combination with HAproxy on FreeBSD #460

Closed
marschro opened this issue Jan 22, 2025 · 15 comments · Fixed by mtrudel/thousand_island#162
Closed

Bug in combination with HAproxy on FreeBSD #460

marschro opened this issue Jan 22, 2025 · 15 comments · Fixed by mtrudel/thousand_island#162

Comments

@marschro
Copy link

marschro commented Jan 22, 2025

Hi all,

I face some issue with Bandit used in a phoenix app in combination with freebsd jail and HAProxy.

I originally started investigating this issue in the elixir forum - starting Jan. 25.

USED VERSIONS

  • OS: FreeBSD RELEASE-14.2

  • HaProxy: 3.0.x (LTS)

  • Elixir: 1.17.3

  • Phoenix: 1.7.19

  • Erlang: Erlang/OTP 26 [erts-14.2.5.5] [source] [64-bit] [smp:2:2] [ds:2:2:10] [async-threads:1] [jit:ns] [dtrace] [sharing-preserving] [amd64-portbld-freebsd14.1]

  • Elixir App is unchanged fresh mix phx.new my_app

  • HaProxy config is pretty simple:

 global
   maxconn 1000
   user haproxy_user
   group haproxy_group
   daemon

defaults
   mode http
   timeout connect 5s
   timeout client 50s
   timeout server 50s
   timeout tunnel 50s

frontend SSL-Offloading
   bind :80
   bind :443 ssl crt valid/path/to/cert alpn h2,http/1.1

   # Redirect all non-ssl requests to secure http protocol
   # redirect scheme https if !{ ssl_fc }
   http-request redirect scheme https code 301 unless { ssl_fc }

   # Redirect requests to domain-root to www subdomain
   http-request redirect code 301 location https://www.%[hdr(host)]%[capture.req.uri] if { hdr(host) -i domain_one.com }

   use_backend www.domain_one.com if { hdr(host) -i www.domain_one.com }
 
backend www.domain_one.com
   balance roundrobin
   server domain_one 192.168.0.1:443 check

DESCRIPTION

  • I create a new Phoenix App with phx.new - Bandit is used as default WebsServer - without further dependencies or changes
  • I deploy the App as an Elixir Release in a jail of a FreeBSD OS with version RELEASE-14.2
  • I use HAProxy 3.0.x (LTS) in front of the jail for ssl termination and routing
  • When starting the elixir app, the app is perfectly fine reachable and responds to requests correctly
  • After pretty much exactly 60 Minutes the App cannot be reached anymore and HAProxy responds correctly with 503 as the app is not responding on its port anymore
  • Also curl-ing the app on its IP and port from the host, results in Failed to connect to 192.168.0.10 port 8080 after 0 ms: Could not connect to server
  • Also no connection from within the jail on the loopback interface
  • So I guess I can exclude upstream issues. Seems like to be the app-server
  • When the app is fine within the first 60 minutes, its perfectly fine reachable with curl via ip and port from host and within the jail via loopback interface 127.0.0.1

OBSERVATIONS

  • Changing the port of the App does not fix the problem
  • Downgrading to HAProxy 2.8 does not fix the problem
  • All logs do not show any hint on what's going on - neither the app log, erlang log, jail log, network stack and logs in os.
  • I can remote login into the Elixir app and it's up and running without issues. But not serving on the port
  • Checking the app-health state with :observer_cli tells me that there is all fine. No blocked queues nor any significant memory consumption. Also no changes in the data before the app gets unresponsive and after. So app is healthy.
  • OS and jail says, that beam.smp is still owning the port and there are no issues in the network stack. There is only one LISTENING on the correct port. The rest is correctly CLOSED, nothing pending or idle or unfinished.
  • The issue did arise after update to HAProxy 3.0 and latest Bandit (1.6.5) - downgrading back to Bandit 1.5.8 does not solve the issue
  • Using Cowboy instead of Bandit fixes the issue and the app is up and running and responsive.
  • I had no problems with Bandit before upgrading to HAProxy 3.0 (before I had 2.8) in combination with Bandit 1.5.7
  • But downgrading HAProxy alone to 2.8 and keeping Bandit with latest version does not fix it. So I assume there is a slight issue in Bandit in combination with HAProxy

What's really interesting is that the app is reachable for pretty much around 60 minutes and then stops serving.
I was able to narrow it down to probably an issue between latest Bandit and HAProxy. Because if I disconnect the app from HAProxy - means deactivate the HAProxy Frontend and Backend, then the app is working fine again (but only tested with curl - not included the whole socket magic that might also be an issue).

WORKAROUND AND FIXES

  • Exchange Bandit with Cowboy

If there are any further details I could provide, I am happy to help.
I also started to investigate the codebase of Bandit, but that domain is currently a bit above my knowledge...

Kind regards,
Martin

@marschro
Copy link
Author

I added a few more details on Versions and the content of the haproxy file.

@marschro marschro changed the title Maybe bug in combination with HAproxy Bug in combination with HAproxy on FreeBSD Feb 25, 2025
@mtrudel
Copy link
Owner

mtrudel commented Feb 25, 2025

Thanks for the updates!

Bandit 1.5.8 never existed; do you mean to say 1.5.7?

Are you saying that HAProxy 2.8 & Bandit 1.5.(7?) works as expected, and all other combinations do not? It would be really useful to know the smallest possible update to either / both libraries that causes the problem to occur; if possible, could you try to bisect for a smaller version change that reproduces the issue?

A fair bit of keepalive logic changed in Bandit around then, which is why I'm keen to try and narrow it down further.

@marschro
Copy link
Author

Hi and thanks @mtrudel for the reply.

Sorry.... correction that was a typo.

  • Yes, I mean Bandit 1.5.7 and edited above issue.
  • And Yes, With combination of HAProxy 2.8 together with Bandit 1.5.7 I have not had an issue. All worked perfectly fine.

Yes, I will try to narrow it down for a smaller version change. But probably won't have the time to do so before the weekend.

I will give an update here!

@marschro
Copy link
Author

marschro commented Mar 11, 2025

Okay, I had some time and tried to get back to a state when Bandit was actually working in this setup.

And the surprising outcome was:

  • I cannot reproduce a case where Bandit is working in this setup.

I downgraded to all versions of Bandit down to 1.0.0.
I also downgraded the OS from 14.2 to 14.1 and also the jails versions.
I also downgraded to haproxy to 2.9 and down to 2.8 and tried those in combination of Bandit lower than current.

In no combination I was able to get rid of the behavior which still is: The Phoenix application, is working totally fine after deployment or restart of the Elixir Release via restart but stops serving after pretty much exactly 120 minutes (2hours) without traffic or further interactions with the release or the app. - which is annoying, as I had to wait always at least 2 hours in order to proof, if the combination of Bandit, haproxy and OS version actually changed anything.

I am completely stuck with my latin why that is happening and start to doubt myself if it actually was ever working - my git history says, I changed from Cowboy to Bandit October 2nd. And then haven't had any issues for months. The app was totally working fine.

I am at the end with my latin its some super weird combination probably of network stack, OS, erlang elixir... that leaves no traces...

I guess I stop investigate that for now, and periodically do update everything... maybe this hiccup will solve by itself magically in the future....?

Or @mtrudel are there any configuration options for Bandit to play around with?

@marschro
Copy link
Author

marschro commented Mar 13, 2025

At least I now found maybe a hint:
I ran the app in production with start_iex
And I get regularly these messages: [error] ** (Bandit.HTTPError) closed

I assume that is because haproxy upfront terminates connections.
But this should not occur before 50 seconds, but is logged after ~5 to 8 seconds a socket has lost, because haproxy is configured with

defaults
    mode http
    timeout connect 50s
    timeout client 50s
    timeout server 50s
    timeout tunnel 50s

It also appears only, when navigating to a LiveView with socket connection and then back to a sole GET page request

[info] CONNECTED TO Phoenix.LiveView.Socket in 124µs
  Transport: :longpoll
  Serializer: Phoenix.Socket.V2.JSONSerializer
  Parameters: %{"_csrf_token" => "[replaced_token]", "_live_referer" => "undefined", "_mount_attempts" => "0", "_mounts" => "0", "_track_static" => %{"0" => "https://playground.devpunx.com/assets/app-13e9adf0c3e4a9ec948e61945f5a793b.css?vsn=d", "1" => "https://playground.devpunx.com/assets/app-979ed861cf9276b7d1c89bba51312f6c.js?vsn=d"}, "vsn" => "2.0.0"}
[debug] MOUNT PlaygroundWeb.UserRegistrationLive
  Parameters: %{}
  Session: %{"_csrf_token" => "[replaced_token]"}
[debug] Replied in 187µs
[info] GET /
[debug] Processing with PlaygroundWeb.PageController.home/2
  Parameters: %{}
  Pipelines: [:browser]
[info] Sent 200 in 2ms
[error] ** (Bandit.HTTPError) closed

@marschro
Copy link
Author

marschro commented Mar 14, 2025

When trying to get more logging for my debugging task I recognized the following error:

 ** (EXIT) an exception was raised:
            ** (RuntimeError) Unsupported key(s) in http_options config: [:log_client_closures]
                (bandit 1.5.7) lib/bandit.ex:357: Bandit.validate_options/3
                (bandit 1.5.7) lib/bandit.ex:241: Bandit.start_link/1

Its still in the documentation but was maybe removed?

UPDATE... we'll actually its still in the keys...
@http_keys ~w(compress deflate_options log_exceptions_with_status_codes log_protocol_errors log_client_closures)a

@marschro
Copy link
Author

marschro commented Mar 14, 2025

UPDATE: More logging

With advanced error logging the above error looks like that:

14:41:31.227 [error] ** (Bandit.HTTPError) closed
    (bandit 1.5.7) lib/bandit/http1/socket.ex:420: Bandit.HTTPTransport.Bandit.HTTP1.Socket.request_error!/2
    (bandit 1.5.7) lib/bandit/http1/socket.ex:357: Bandit.HTTPTransport.Bandit.HTTP1.Socket.send_data/3
    (bandit 1.5.7) lib/bandit/adapter.ex:242: Bandit.Adapter.send_data/3
    (bandit 1.5.7) lib/bandit/adapter.ex:140: Bandit.Adapter.send_resp/4
    (plug 1.16.1) lib/plug/conn.ex:444: Plug.Conn.send_resp/1
    (playground 0.6.20) lib/playground_web/endpoint.ex:1: PlaygroundWeb.Endpoint.do_socket_dispatch/2
    (playground 0.6.20) lib/playground_web/endpoint.ex:1: PlaygroundWeb.Endpoint.plug_builder_call/2
    (playground 0.6.20) lib/playground_web/endpoint.ex:1: PlaygroundWeb.Endpoint.call/2

Also when getting Process.info of the Bandit Process this looks normal to me:
This is how it looks, when Bandit stopped serving:

[
  current_function: {:gen_server, :loop, 7},
  initial_call: {:proc_lib, :init_p, 5},
  status: :waiting,
  message_queue_len: 0,
  links: [#PID<0.1931.0>, #PID<0.1932.0>, #PID<0.2234.0>, #PID<0.1925.0>],
  dictionary: [
    "$initial_call": {:supervisor, ThousandIsland.Server, 1},
    "$ancestors": [PlaygroundWeb.Endpoint, Playground.Supervisor,
     #PID<0.1895.0>]
  ],
  trap_exit: true,
  error_handler: :error_handler,
  priority: :normal,
  group_leader: #PID<0.1894.0>,
  total_heap_size: 6771,
  heap_size: 2586,
  stack_size: 11,
  reductions: 9485,
  garbage_collection: [
    max_heap_size: %{
      error_logger: true,
      include_shared_binaries: false,
      kill: true,
      size: 0
    },
    min_bin_vheap_size: 46422,
    min_heap_size: 233,
    fullsweep_after: 65535,
    minor_gcs: 2
  ],
  suspending: []
]

Also interesting:

When I kill the Bandit process the running application manually (in the state that Bandit stopped working), then the supervisor obviously restarts Bandit and Bandit is immediately serving again...
So it really looks like something within bandit and not Application or higher level....

@marschro
Copy link
Author

marschro commented Mar 16, 2025

Hi @mtrudel,

sorry to bother you so much in this issue - but finally I found the cause!

I can reproduce the behavior (Phoenix App with Bandit being unresponsive after about 1h of uptime).
And I also have a way how to fix that.

So it's not especially a Bandit problem. its HaProxy thing - at least the cause is a configuration in haproxy.
But I want to let you know about this, so maybe you can judge if this is totally fine and expected behavior or if there should be also a mitigation on Bandit side.

The Setup

The described behavior can be reproduced with the following haproxy config (simplified version):

global
    maxconn 4000
    user nobody
    group nobody
    daemon

defaults
    mode http
    timeout connect 50s
    timeout client 50s
    timeout server 50s
    timeout tunnel 50s

frontend SSL-Offloading
    bind :80
    bind :443 ssl crt path/to/my/cert.pem alpn h2,http/1.1

    # Redirect all non-ssl requests to secure http protocol
    # redirect scheme https if !{ ssl_fc }
    http-request redirect scheme https code 301 unless { ssl_fc }

    # Redirect requests to domain-root to www subdomain
    http-request redirect code 301 location https://www.%[hdr(host)]%[capture.req.uri] if { hdr(host) -I my-domain.com }

    # Route to backend based on host header
    use_backend www.my-domain.com if { hdr(host) -i my-domain.com }
    use_backend www.my-other-domain.com if { hdr(host) -i my-other-domain.com }
   
# This backend uses bandit
backend www.my-domain.com
    balance roundrobin
    server my_domain 192.168.0.10:8080 check

# This backend uses Cowboy
backend www.my-other-domain.com
    balance roundrobin
    server my_other_domain 192.168.0.09:8080 check 

To mention here are two things:

  1. The global maxconn is pretty low with 4000.
  2. Haproxy is asked to do repeating TCP backend health checks if the application/server on 192.168.0.10:8080 is alive. That's done by adding check at the end (more on those, see here )

So what happens...

In order to see, what's going on between haproxy and 192.168.0.10:8080, I checked with tpcdump on the interface - and that's what we get (a three second excerpt)

19:44:16.929709 IP 192.168.0.0.37370 > 192.168.0.10.http-alt: Flags [S], seq 2975014247, win 65535, options [mss 16344,nop,wscale 6,sackOK,TS val 1107760135 ecr 0], length 0
19:44:16.929726 IP 192.168.0.10.http-alt > 192.168.0.0.37370: Flags [S.], seq 3149284940, ack 2975014248, win 65535, options [mss 16344,nop,wscale 6,sackOK,TS val 3581736569 ecr 1107760135], length 0
19:44:16.929736 IP 192.168.0.0.37370 > 192.168.0.10.http-alt: Flags [.], ack 1, win 1277, options [nop,nop,TS val 1107760135 ecr 3581736569], length 0
19:44:16.929764 IP 192.168.0.0.37370 > 192.168.0.10.http-alt: Flags [R.], seq 1, ack 1, win 0, options [nop,nop,TS val 1107760135 ecr 3581736569], length 0


19:44:18.944042 IP 192.168.0.0.25977 > 192.168.0.10.http-alt: Flags [S], seq 3733863170, win 65535, options [mss 16344,nop,wscale 6,sackOK,TS val 534740319 ecr 0], length 0
19:44:18.944127 IP 192.168.0.10.http-alt > 192.168.0.0.25977: Flags [S.], seq 3111848782, ack 3733863171, win 65535, options [mss 16344,nop,wscale 6,sackOK,TS val 2072667181 ecr 534740319], length 0
19:44:18.944154 IP 192.168.0.0.25977 > 192.168.0.10.http-alt: Flags [.], ack 1, win 1277, options [nop,nop,TS val 534740319 ecr 2072667181], length 0
19:44:18.944190 IP 192.168.0.0.25977 > 192.168.0.10.http-alt: Flags [R.], seq 1, ack 1, win 0, options [nop,nop,TS val 534740319 ecr 2072667181], length 0


19:44:20.965536 IP 192.168.0.0.52806 > 192.168.0.10.http-alt: Flags [S], seq 1840030346, win 65535, options [mss 16344,nop,wscale 6,sackOK,TS val 3195020104 ecr 0], length 0
19:44:20.965554 IP 192.168.0.10.http-alt > 192.168.0.0.52806: Flags [S.], seq 2846094430, ack 1840030347, win 65535, options [mss 16344,nop,wscale 6,sackOK,TS val 3078994868 ecr 3195020104], length 0
19:44:20.965566 IP 192.168.0.0.52806 > 192.168.0.10.http-alt: Flags [.], ack 1, win 1277, options [nop,nop,TS val 3195020104 ecr 3078994868], length 0
19:44:20.965605 IP 192.168.0.0.52806 > 192.168.0.10.http-alt: Flags [R.], seq 1, ack 1, win 0, options [nop,nop,TS val 3195020104 ecr 3078994868], length 0

So what we see is:

  • handshake [S]
  • yo, handshake [S.]
  • acknowledge [.]
  • reset [R.]

And this happens - because haproxy checks if the server is alive over and over every second...

My interpretation

I am not totally sure who brings in the misunderstanding, haproxy or Bandit...
But what I think happens, is:

  1. Haproxy checks if the server is alive.
  2. Bandit takes the connection and keeps it open (maybe ignoring the reset? - or does haproxy not properly release and keeps them open? a bit unclear to me)
  3. With 4000 maximum open connections, haproxy reaches the limit in roughly 66 minutes - that's why the server stops working mostly at the same time (roughly one hour plus minus, when I also triggered additional requests)
  4. At the end, there are 4000 open connections - caused by haproxys own check every second, and finally haproxy stops accepting any further requests for that domain and responds with 503 - no server available.

How to fix it

Besides choosing more reasonable 50000 for global maxconn,

  • I removed check for the server my_domain on 192.168.0.10:8080 check

So its like that now:

backend www.my-domain.com
    balance roundrobin
    server my_domain 192.168.0.10:8080

Now everything works fine, and the server was up for over 24 hours and is still working as desired...

As I assume, that the TCP check request from haproxy is the problem (it opens a connection and not properly closes it), I changed the configuration to:

backend www.my-domain.com
    balance roundrobin
    option httpchk GET /
    server my_domain 192.168.0.10:8080 check inter 5s

Probably this also fixes the issue, and connections are properly closed again as its now check on http protocol level and not on TCP level.
I just changed it and currently observe if this is also a fix. Will give my findings here, later...

UPDATE on changing health checks to http => Yes, this also works properly. No connection overflow anymore, when activating httpchk

What I do not get.

  • As I set up maxconn 4000 globally, I would assume, that the other server which is served with Cowboy would also stop, but actually is still accepting requests... But that's probably something managed internally by Haproxy, so that one server could not reach all the connections and then block others, instead, probably, haproxy is conserving or preserving requests for other servers ....
  • Why isn't Haproxy that saying, that it stops accepting requests or is showing that fact on the haproxy stats page? This went all totally silent 🙁 - but as this is all my interpretation - maybe this is some edge case... dunno...
  • What I also do not get, why Bandit keeps every single connection created by the haproxy check. So I miss a lot of knowledge on what happens here on lower level (even though I watched your videos 🤣 - for sure not enough)

... puh I am glad that I found the cause at the end.
If this is something that is totally unrelated to Bandit and should be handled by correct haproxy configuration, then we can close this.
But maybe we could leave a note in the documentation about this, as I think haproxy is widely used ?

Cheers,
Martin

@mtrudel
Copy link
Owner

mtrudel commented Mar 21, 2025

This is some great detective work @marschro !

I've got a repro case setup locally based on your info; I don't see any obvious solution (mostly because all of this happens before Bandit/Thousand Island is even involved), but I'll keep digging.

@mtrudel
Copy link
Owner

mtrudel commented Mar 21, 2025

Got it.

The issue originated in how Thousand Island acceptors handled invalid sockets (which, for whatever reason, is how the runtime surfaces HAProxy TCP checks). This actually may explain a few other longstanding heisenbugs as well.

Fix coming to Thousand Island in a few minutes.

@mtrudel
Copy link
Owner

mtrudel commented Mar 21, 2025

Bump your Thousand Island dip to >= 1.3.12 and you should be golden!

@mtrudel
Copy link
Owner

mtrudel commented Mar 21, 2025

Thanks again for all the detective work!

@dch
Copy link

dch commented Mar 22, 2025

Ironic that I only just found this issue last week, and .... after tracking it down I find this issue already fixed. Thanks @mtrudel and @marschro for the sweet detective work. And please always @ me if you see a FreeBSD issue ;-)

@dch
Copy link

dch commented Mar 22, 2025

4000 maxconn in haproxy is a lot - 4000 active concurrent connections.
Bumping this to a higher level is just pushing the issue further down.

I added some other useful troubleshooting tips, that help move faster
next time.

Basically I am not seeing these connections drain, and the kernel, then
buffers a % until it refuses further requests.

  • connections arrive from haproxy load balancer on different system

  • go through 2 firewalls, then get handed to beam webserver

  • load test via wrk2 -d 900s -c 800 -t 8 -L -R 800 https://whiskey.skunkwerks.at/

  • after the load test is complete, there should be no outstanding connections

  • but after load test is complete, we see active sessions still! These
    are through the LB and are pending acceptance in the kernel on the box
    where Bandit is running:

Image

  • haproxy reports 63 active connections (the backlogged ones to the kernel)

On the BEAM box:

  • netstat reports 63 in ESTABLISHED state (kernel acceptors)
  • netstat report 130 in CLOSED state
  • beam has 128 open connections (total acceptors) according to the kernel
  • but recon.tcp() only reports 4 current tcp connections in BEAM
### on the server running BEAM
# netstat -ALanp tcp
Current listen queue sizes (qlen/incqlen/maxqlen)
Tcpcb            Proto Listen                           Local Address
fffff80e8f69e000 tcp4  193/0/128                        *.4000

# netstat -Anp tcp
Tcpcb            Proto Recv-Q Send-Q Local Address      Foreign Address    (state)
fffff8139bc8b000 tcp4      76      0 172.16.1.4.4000    172.16.1.1.54870   ESTABLISHED
... 63x of them
fffff81debf5e540 tcp4       0      0 172.16.1.4.4000    172.16.1.1.65056   CLOSED
... 130x of them

This isn't related to haproxy, but is definitely exacerbated by having the
healthchecks use up connections faster!

@marschro
Copy link
Author

marschro commented Mar 25, 2025

Oh wow 🤩 - That’s great news!
Thank you so much @mtrudel !

Also thank you @dch ! Will keep in mind mentioning you in the future for BSD related stuff!

Cheers 🍻 😁
Martin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants