Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hackney 1.17.4 connections stuck in a pool #683

Closed
sircinek opened this issue Mar 25, 2021 · 26 comments
Closed

hackney 1.17.4 connections stuck in a pool #683

sircinek opened this issue Mar 25, 2021 · 26 comments

Comments

@sircinek
Copy link

Problem

Connections getting stuck and never being checkin again exhausting the pool (making it unusable without restart).

Version/usage

hackney 1.17.4 as an adapter to tesla 1.3.3 but we have seen similar problems with httpoison running hackney 1.17.4 too

erlang 23.0.4
elixir 1.10.4-otp-23

Observations

iex(app@localhost)4> Application.loaded_applications() |> Enum.find(fn x -> elem(x,0) == :hackney end)
{:hackney, 'simple HTTP client', '1.17.4'}
iex(app@localhost)5> :hackney_pool.get_stats "auth_client_connection_pool"
[
  name: "auth_client_connection_pool",
  max: 50,
  in_use_count: 50,
  free_count: 0,
  queue_count: 27
]

It looks like after 24~hours of uptime we were able to hit issue looking similar to what's tried to be fixed in #681
I am not sure it is the same problem, as we are unable to reproduce it consistently, nor we understand all the circumstances around the problem. Looking at our metrics, I see some requests timing out/failing prior to this happening, but nothing conclusive.

All the processes that have connections checked out or are in a queued are no longer alive and it's been like that for over 24hour now.

@benoitc
Copy link
Owner

benoitc commented Mar 25, 2021 via email

@sircinek
Copy link
Author

sircinek commented Mar 25, 2021

As I said, we are using Tesla with hackney adapter, but it should be more or less like this:

case hackney:request(post, <<"http://some_service:5060/endpoint">>, Headers, Body, [{pool, <<"custom_pool">>}]) of
  {ok, Status, Headers, Ref} when is_reference(Ref) ->
    case hackney:body(Ref) of
      {ok, Body} -> {ok, Status, Headers, Body};
      Err -> Err
    end;
  {ok, Status, Headers} ->
    {ok, Status, Headers, []};
  {error, _} = Error ->
    Error
end

Adapter code

I did the trace of :hackney module when calling our lib, and that's the result (the data is fake, but just to give you overview)

12:59:06.308032	 {:trace, #PID<0.670.0>, :call, {:hackney, :request, [:post, "http://some_service:5060/endpoint", [{"content-type", "application/json"}], "{\"some_request\":\"some_data\"}", [pool: "custom_pool"]]}, {Tesla.Adapter.Hackney, :request, 5}}
12:59:06.813365	 {:trace, #PID<0.670.0>, :return_from, {:hackney, :request, 5}, {:ok, 404, [{"Content-Type", "application/json; charset=utf-8"}, {"Content-Length", "158"}, {"Date", "Thu, 25 Mar 2021 12:59:06 GMT"}, {"Server", "some_server"}], #Reference<0.275978613.3853516803.200933>}}
12:59:06.814174	 {:trace, #PID<0.670.0>, :call, {:hackney, :body, [#Reference<0.275978613.3853516803.200933>]}, {Tesla.Adapter.Hackney, :handle, 1}}
12:59:06.814637	 {:trace, #PID<0.670.0>, :return_from, {:hackney, :body, 1}, {:ok, "{\"error\":\"some_error\"}"}}

I did the same excercise from the node that have exhausted pool (all connections in use) and I see:

13:16:35.501878	 {:trace, #PID<0.16252.376>, :call, {:hackney, :request, [:post, "http://some_service:5060/endpoint", [{"content-type", "application/json"}],  "{\"some_request\":\"some_data\"}", [pool: "custom_pool"]]}, {Tesla.Adapter.Hackney, :request, 5}}
13:16:36.502299	 {:trace, #PID<0.16252.376>, :exception_from, {:hackney, :request, 5}, {:exit, :shutdown}}
13:16:36.524614	 {:trace, #PID<0.16533.376>, :call, {:hackney, :request, [:post, "http://some_service:5060/endpoint", [{"content-type", "application/json"}],  "{\"some_request\":\"some_data\"}", [pool: "custom_pool"]]}, {Tesla.Adapter.Hackney, :request, 5}}
13:16:37.525295	 {:trace, #PID<0.16533.376>, :exception_from, {:hackney, :request, 5}, {:exit, :shutdown}}
13:16:37.556506	 {:trace, #PID<0.16487.376>, :call, {:hackney, :request, [:post, "http://some_service:5060/endpoint", [{"content-type", "application/json"}],  "{\"some_request\":\"some_data\"}", [pool: "custom_pool"]]}, {Tesla.Adapter.Hackney, :request, 5}}
13:16:38.557300	 {:trace, #PID<0.16487.376>, :exception_from, {:hackney, :request, 5}, {:exit, :shutdown}}

We are using Tesla.Middleware.Timeout that runs async task in separate process and terminates the request if it takes more than configurable timeout (default 1s), alongside Tesla.Middleware.Retry that retries request configurable amount of times with configurable delay. Maybe that contributes to the problem, although the exit(shutdown) is almost immediate after request is made (weird?) That's after 1s, so it comes from Timeout middleware terminating the task.

EDIT:
I just tried using hackney directly without Tesla nor our library wrapper, and it returns :checkout_timeout as expected..

iex(app@localhost)40> :hackney.request(:post, "http://some_service:5060/endpoint"", [{"content-type", "application/json"}], {\"some_request\":\"some_data\"}", [pool: "custom_pool"])
{:error, :checkout_timeout}

Looks like Tesla is messing up with something, I will investigate further

@sircinek
Copy link
Author

sircinek commented Mar 25, 2021

Given how Tesla middlewares interact (especially Tesla.Middleware.Timeout)with the hackney request, the snippet should be more like:

Retries = X,
RertryDelay = Y,
Timeout = Z,
Requester = self(),

Request = fun() ->
  Result = case hackney:request(post, <<"http://some_service:5060/endpoint">>, Headers, Body, [{pool, <<"custom_pool">>}]) of
    {ok, Status, Headers, Ref} when is_reference(Ref) ->
      case hackney:body(Ref) of
        {ok, Body} -> {ok, Status, Headers, Body};
        Err -> Err
      end;
    {ok, Status, Headers} ->
      {ok, Status, Headers, []};
    {error, _} = Error ->
      Error
   end,
   Requester ! {result, Result}
end,

HandleSingleRequest = fun() ->
  try 
    {ok, Task} = spawn_link(Request),
    receive 
       {result, _} = Res-> 
         throw(Res)
    after 
      Timeout ->
         exit(Task, shutdown)
         {error, timeout}
     end
  catch 
       exit:{timeout, _} ->
         {error, timeout}
       exit:shutdown ->
         {error, timeout}
end,

try 
  hd([begin R = HandleSingleRequest(), timer:sleep(RetryDelay), R end || _ <- lists:seq(0,Retries)])
catch
  throw:{result, Result} ->
   Result
end

forgive me the typos/syntax errors, haven't done Erlang in some time ;)

@sircinek
Copy link
Author

sircinek commented Mar 26, 2021

If it helps with investigation, it seems that always happen once requests starts to be queued, I haven't seen it occurring without any queue elements. So my wild guess would be, that things start to slow down for us (amount of requests vs response time), we start to queue requests due to no connections available in the pool, and at some point we are hitting the unlikely race condition between is_process_alive/1 and erlang:send/2 as this is done in 2 steps, and Tasks are killed asynchronously by Timeout middleware (exit(shutdown)). (but this is just a wild guess, no hard proof of that).

To try to aid the problem for us, I will make checkout_timeout of hackney be shorter by reasonable amount (like 100ms or 200ms) than Tesla.Middleware.Timeout, if that's solves the problem for us, that will basically confirm my guess. 🤷‍♂️

@sircinek
Copy link
Author

sircinek commented Mar 26, 2021

Another observation. On the node where problem reproduced - if I increase max_connections limit for the stuck pool by one, nothing changes (queued requests don't get dequeued).
When I make fresh request using the pool (successful because I have capacity now), once my request returns, I see one of the queued elements gets removed from the queue (but process that was requesting is not alive anymore), and the state of the pool changes to be like this:

iex(app@localhost)20> :hackney_pool.get_stats("auth_client_connection_pool")
[
  name: "auth_client_connection_pool",
  max: 51,
  in_use_count: 50,
  free_count: 0,
  queue_count: 26
]

when I repeat the exercise I get another one out of the queue, but requests do not continue to get dequeued once connections are returned... 🤔

@benoitc
Copy link
Owner

benoitc commented Mar 26, 2021 via email

@sircinek
Copy link
Author

sircinek commented Mar 29, 2021

Thanks a lot for the feedback. I think I understand now what’s happening.
I will fix it tonight/this week-end :)

Could you elaborate where is the problem ? Decreasing checkout_timeout didn't prevent the issue from happening FYI :)

@benoitc
Copy link
Owner

benoitc commented Mar 30, 2021

@sircinek better to talk over the code :)

That should land as a PR later today or in the morning tomorrow finally. Roughly speaking spawning process to checkout the connection is actually breaking the original design that was using the manager and expect the connection beeing synchronously checkout by the requester.

The new design is putting back more synchronicity. While I am here I am fixing SSL handling. As I said I will post the code ASAP today/tomorrow morning (CET) :)

@sircinek
Copy link
Author

@sircinek better to talk over the code :)

That should land as a PR later today or in the morning tomorrow finally. Roughly speaking spawning process to checkout the connection is actually breaking the original design that was using the manager and expect the connection beeing synchronously checkout by the requester.

The new design is putting back more synchronicity. While I am here I am fixing SSL handling. As I said I will post the code ASAP today/tomorrow morning (CET) :)

Wonderful. Thank you so much for looking into this 🙇

@sircinek
Copy link
Author

sircinek commented Apr 1, 2021

Hey @benoitc, any updates on this? :)

@benoitc
Copy link
Owner

benoitc commented Apr 1, 2021 via email

@sircinek
Copy link
Author

sircinek commented Apr 7, 2021

@benoitc maybe you need a hand with this ?

@benoitc
Copy link
Owner

benoitc commented Apr 7, 2021 via email

@benoitc
Copy link
Owner

benoitc commented Apr 13, 2021

short update, but a new branch will land tomorrow. i’ve been sidetracked in between.

@sircinek
Copy link
Author

@benoitc how's it going? We are kind of stuck working around that issue, I guess we will downgrade to 1.15.2 in the meantime, or use another library..

@benoitc
Copy link
Owner

benoitc commented Apr 21, 2021 via email

@Aduril
Copy link

Aduril commented May 5, 2021

Hello @benoitc,
Hello @sircinek,

thanks for pointing out that error and thanks for your work so far. Were sitting in the same boat here, so I was wondering if there is any way to help you out? Me and my team could volunteer to test it on our platform, if that helps you :)

Kind regards

@benoitc
Copy link
Owner

benoitc commented May 5, 2021 via email

@Aduril
Copy link

Aduril commented May 18, 2021

Hello @benoitc,

I don't want to annoy you and appricate the work you are doing here, but is there any way to move some obstacles out of your way? If you prefer you can also DM me (for example on Twitter).

Kind regards
Peter

@benoitc benoitc added the pool label May 19, 2021
@benoitc
Copy link
Owner

benoitc commented May 19, 2021

Hello @benoitc,

I don't want to annoy you and appricate the work you are doing here, but is there any way to move some obstacles out of your way? If you prefer you can also DM me (for example on Twitter).

Kind regards
Peter

@Aduril as I quickly told you online I will describe in a separate process what's the roadmap on and a way to help the project to ensure we get a release more often.

In the mean time please try the following hot-fix 592a007 which should fix this issue. Please let me know!

@benoitc
Copy link
Owner

benoitc commented May 19, 2021

to all I pushed the hot-fix 592a007 which should fix this issue. Please let me know!

@benoitc
Copy link
Owner

benoitc commented May 20, 2021

the patch above seems to fix it according to my tests. Closing the issue then. Feel free to reopen it if needed.

@benoitc benoitc closed this as completed May 20, 2021
@ettomatic
Copy link

Thanks a lot for this @benoitc! Are you planning a new release for this?

@benoitc
Copy link
Owner

benoitc commented May 27, 2021 via email

@josealejandromu
Copy link

Hello @benoitc ! Is there going to be a release soon to solve this issue ?

@benoitc
Copy link
Owner

benoitc commented Sep 22, 2021

yes this is planned along other announcements by Friday.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants