hackney 1.17.4 connections stuck in a pool #683

sircinek · 2021-03-25T12:16:09Z

Problem

Connections getting stuck and never being checkin again exhausting the pool (making it unusable without restart).

Version/usage

hackney 1.17.4 as an adapter to tesla 1.3.3 but we have seen similar problems with httpoison running hackney 1.17.4 too

erlang 23.0.4
elixir 1.10.4-otp-23

Observations

iex(app@localhost)4> Application.loaded_applications() |> Enum.find(fn x -> elem(x,0) == :hackney end)
{:hackney, 'simple HTTP client', '1.17.4'}
iex(app@localhost)5> :hackney_pool.get_stats "auth_client_connection_pool"
[
  name: "auth_client_connection_pool",
  max: 50,
  in_use_count: 50,
  free_count: 0,
  queue_count: 27
]

It looks like after 24~hours of uptime we were able to hit issue looking similar to what's tried to be fixed in #681
I am not sure it is the same problem, as we are unable to reproduce it consistently, nor we understand all the circumstances around the problem. Looking at our metrics, I see some requests timing out/failing prior to this happening, but nothing conclusive.

All the processes that have connections checked out or are in a queued are no longer alive and it's been like that for over 24hour now.

The text was updated successfully, but these errors were encountered:

benoitc · 2021-03-25T12:23:02Z

Can you provide a snippet on how you make your requests?

On Thu 25 Mar 2021 at 13:16, sircinek ***@***.***> wrote: Problem Connections getting stuck and never being checkin again exhausting the pool (making it unusable without restart). Version/usage hackney 1.17.4 as an adapter to tesla 1.3.3 but we have seen similar problems with httpoison running hackney 1.17.4 too erlang 23.0.4 elixir 1.10.4-otp-23 Observations ***@***.***)4> Application.loaded_applications() |> Enum.find(fn x -> elem(x,0) == :hackney end) {:hackney, 'simple HTTP client', ***@***.***)5> :hackney_pool.get_stats "auth_client_connection_pool" [ name: "auth_client_connection_pool", max: 50, in_use_count: 50, free_count: 0, queue_count: 27 ] It looks like after 24~hours of uptime we were able to hit issue looking similar to what's tried to be fixed in #681 <#681> I am not sure it is the same problem, as we are unable to reproduce it consistently, nor we understand all the circumstances around the problem. Looking at our metrics, I see some requests timing out/failing prior to this happening, but nothing conclusive. All the processes that have connections checked out or are in a queued are no longer alive and it's been like that for over 24hour now. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#683>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAADRIUAIYAFL3S52XKIP63TFMSRVANCNFSM4ZZHTGIQ> .

-- Sent from my Mobile

sircinek · 2021-03-25T13:21:54Z

As I said, we are using Tesla with hackney adapter, but it should be more or less like this:

case hackney:request(post, <<"http://some_service:5060/endpoint">>, Headers, Body, [{pool, <<"custom_pool">>}]) of
  {ok, Status, Headers, Ref} when is_reference(Ref) ->
    case hackney:body(Ref) of
      {ok, Body} -> {ok, Status, Headers, Body};
      Err -> Err
    end;
  {ok, Status, Headers} ->
    {ok, Status, Headers, []};
  {error, _} = Error ->
    Error
end

Adapter code

I did the trace of :hackney module when calling our lib, and that's the result (the data is fake, but just to give you overview)

12:59:06.308032	 {:trace, #PID<0.670.0>, :call, {:hackney, :request, [:post, "http://some_service:5060/endpoint", [{"content-type", "application/json"}], "{\"some_request\":\"some_data\"}", [pool: "custom_pool"]]}, {Tesla.Adapter.Hackney, :request, 5}}
12:59:06.813365	 {:trace, #PID<0.670.0>, :return_from, {:hackney, :request, 5}, {:ok, 404, [{"Content-Type", "application/json; charset=utf-8"}, {"Content-Length", "158"}, {"Date", "Thu, 25 Mar 2021 12:59:06 GMT"}, {"Server", "some_server"}], #Reference<0.275978613.3853516803.200933>}}
12:59:06.814174	 {:trace, #PID<0.670.0>, :call, {:hackney, :body, [#Reference<0.275978613.3853516803.200933>]}, {Tesla.Adapter.Hackney, :handle, 1}}
12:59:06.814637	 {:trace, #PID<0.670.0>, :return_from, {:hackney, :body, 1}, {:ok, "{\"error\":\"some_error\"}"}}

I did the same excercise from the node that have exhausted pool (all connections in use) and I see:

13:16:35.501878	 {:trace, #PID<0.16252.376>, :call, {:hackney, :request, [:post, "http://some_service:5060/endpoint", [{"content-type", "application/json"}],  "{\"some_request\":\"some_data\"}", [pool: "custom_pool"]]}, {Tesla.Adapter.Hackney, :request, 5}}
13:16:36.502299	 {:trace, #PID<0.16252.376>, :exception_from, {:hackney, :request, 5}, {:exit, :shutdown}}
13:16:36.524614	 {:trace, #PID<0.16533.376>, :call, {:hackney, :request, [:post, "http://some_service:5060/endpoint", [{"content-type", "application/json"}],  "{\"some_request\":\"some_data\"}", [pool: "custom_pool"]]}, {Tesla.Adapter.Hackney, :request, 5}}
13:16:37.525295	 {:trace, #PID<0.16533.376>, :exception_from, {:hackney, :request, 5}, {:exit, :shutdown}}
13:16:37.556506	 {:trace, #PID<0.16487.376>, :call, {:hackney, :request, [:post, "http://some_service:5060/endpoint", [{"content-type", "application/json"}],  "{\"some_request\":\"some_data\"}", [pool: "custom_pool"]]}, {Tesla.Adapter.Hackney, :request, 5}}
13:16:38.557300	 {:trace, #PID<0.16487.376>, :exception_from, {:hackney, :request, 5}, {:exit, :shutdown}}

We are using Tesla.Middleware.Timeout that runs async task in separate process and terminates the request if it takes more than configurable timeout (default 1s), alongside Tesla.Middleware.Retry that retries request configurable amount of times with configurable delay. ~~Maybe that contributes to the problem, although the exit(shutdown) is almost immediate after request is made (weird?)~~ That's after 1s, so it comes from Timeout middleware terminating the task.

EDIT:
I just tried using hackney directly without Tesla nor our library wrapper, and it returns :checkout_timeout as expected..

iex(app@localhost)40> :hackney.request(:post, "http://some_service:5060/endpoint"", [{"content-type", "application/json"}], {\"some_request\":\"some_data\"}", [pool: "custom_pool"])
{:error, :checkout_timeout}

Looks like Tesla is messing up with something, I will investigate further

sircinek · 2021-03-25T14:35:11Z

Given how Tesla middlewares interact (especially Tesla.Middleware.Timeout)with the hackney request, the snippet should be more like:

Retries = X,
RertryDelay = Y,
Timeout = Z,
Requester = self(),

Request = fun() ->
  Result = case hackney:request(post, <<"http://some_service:5060/endpoint">>, Headers, Body, [{pool, <<"custom_pool">>}]) of
    {ok, Status, Headers, Ref} when is_reference(Ref) ->
      case hackney:body(Ref) of
        {ok, Body} -> {ok, Status, Headers, Body};
        Err -> Err
      end;
    {ok, Status, Headers} ->
      {ok, Status, Headers, []};
    {error, _} = Error ->
      Error
   end,
   Requester ! {result, Result}
end,

HandleSingleRequest = fun() ->
  try 
    {ok, Task} = spawn_link(Request),
    receive 
       {result, _} = Res-> 
         throw(Res)
    after 
      Timeout ->
         exit(Task, shutdown)
         {error, timeout}
     end
  catch 
       exit:{timeout, _} ->
         {error, timeout}
       exit:shutdown ->
         {error, timeout}
end,

try 
  hd([begin R = HandleSingleRequest(), timer:sleep(RetryDelay), R end || _ <- lists:seq(0,Retries)])
catch
  throw:{result, Result} ->
   Result
end

forgive me the typos/syntax errors, haven't done Erlang in some time ;)

sircinek · 2021-03-26T09:07:26Z

If it helps with investigation, it seems that always happen once requests starts to be queued, I haven't seen it occurring without any queue elements. So my wild guess would be, that things start to slow down for us (amount of requests vs response time), we start to queue requests due to no connections available in the pool, and at some point we are hitting the unlikely race condition between is_process_alive/1 and erlang:send/2 as this is done in 2 steps, and Tasks are killed asynchronously by Timeout middleware (exit(shutdown)). (but this is just a wild guess, no hard proof of that).

To try to aid the problem for us, I will make checkout_timeout of hackney be shorter by reasonable amount (like 100ms or 200ms) than Tesla.Middleware.Timeout, if that's solves the problem for us, that will basically confirm my guess. 🤷‍♂️

sircinek · 2021-03-26T12:31:52Z

Another observation. On the node where problem reproduced - if I increase max_connections limit for the stuck pool by one, nothing changes (queued requests don't get dequeued).
When I make fresh request using the pool (successful because I have capacity now), once my request returns, I see one of the queued elements gets removed from the queue (but process that was requesting is not alive anymore), and the state of the pool changes to be like this:

iex(app@localhost)20> :hackney_pool.get_stats("auth_client_connection_pool")
[
  name: "auth_client_connection_pool",
  max: 51,
  in_use_count: 50,
  free_count: 0,
  queue_count: 26
]

when I repeat the exercise I get another one out of the queue, but requests do not continue to get dequeued once connections are returned... 🤔

benoitc · 2021-03-26T15:44:27Z

Thanks a lot for the feedback. I think I understand now what’s happening. I will fix it tonight/this week-end :)

On Fri 26 Mar 2021 at 13:32, sircinek ***@***.***> wrote: Another observation. If I increase max_connections limit for the pool by one, nothing changes. When I make a request (successful because I have capacity in the pool), I see one of the queued elements gets removed from the queue (but process that was requesting is not alive anymore), and the state of the pool changes to be like this: ***@***.***)20> :hackney_pool.get_stats("auth_client_connection_pool") [ name: "auth_client_connection_pool", max: 51, in_use_count: 50, free_count: 0, queue_count: 26 ] when I repeat the exercise I get another one out of the queue, but requests do not continue to get dequeued.. once connections are returned 🤔 — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#683 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAADRIW56S4FTMDU36F2UQ3TFR5ERANCNFSM4ZZHTGIQ> .

-- Sent from my Mobile

sircinek · 2021-03-29T07:41:43Z

Thanks a lot for the feedback. I think I understand now what’s happening.
I will fix it tonight/this week-end :)

Could you elaborate where is the problem ? Decreasing checkout_timeout didn't prevent the issue from happening FYI :)

benoitc · 2021-03-30T07:05:44Z

@sircinek better to talk over the code :)

That should land as a PR later today or in the morning tomorrow finally. Roughly speaking spawning process to checkout the connection is actually breaking the original design that was using the manager and expect the connection beeing synchronously checkout by the requester.

The new design is putting back more synchronicity. While I am here I am fixing SSL handling. As I said I will post the code ASAP today/tomorrow morning (CET) :)

sircinek · 2021-03-30T07:32:37Z

@sircinek better to talk over the code :)

That should land as a PR later today or in the morning tomorrow finally. Roughly speaking spawning process to checkout the connection is actually breaking the original design that was using the manager and expect the connection beeing synchronously checkout by the requester.

The new design is putting back more synchronicity. While I am here I am fixing SSL handling. As I said I will post the code ASAP today/tomorrow morning (CET) :)

Wonderful. Thank you so much for looking into this 🙇

sircinek · 2021-04-01T07:52:00Z

Hey @benoitc, any updates on this? :)

benoitc · 2021-04-01T11:32:28Z

this is still in progress, I should be able to push a branch later today.

On Thu 1 Apr 2021 at 09:52, sircinek ***@***.***> wrote: Hey @benoitc <https://github.com/benoitc>, any updates on this? :) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#683 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAADRIRPUEYTTDWV4SCUZJLTGQQ27ANCNFSM4ZZHTGIQ> .

-- Sent from my Mobile

sircinek · 2021-04-07T12:34:16Z

@benoitc maybe you need a hand with this ?

benoitc · 2021-04-07T23:18:19Z

Actually this will land at the end of the week. Got sidetracked by a refactoring. I wouldn’t mind some review when it’s out anyway :)

On Wed 7 Apr 2021 at 14:34, sircinek ***@***.***> wrote: @benoitc <https://github.com/benoitc> maybe you need a hand with this ? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#683 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAADRIQOUU2WKUM7FM6IPCLTHRGNNANCNFSM4ZZHTGIQ> .

-- Sent from my Mobile

benoitc · 2021-04-13T13:22:40Z

short update, but a new branch will land tomorrow. i’ve been sidetracked in between.

sircinek · 2021-04-20T11:29:08Z

@benoitc how's it going? We are kind of stuck working around that issue, I guess we will downgrade to 1.15.2 in the meantime, or use another library..

benoitc · 2021-04-21T12:42:27Z

Sorry it take more time than expected. I’m testing a solution on latest OTP/Erlang since one week... I will make it available by friday. One way to speed its dev is to allow me to dedicate full time on it btw...

On Tue 20 Apr 2021 at 13:29, sircinek ***@***.***> wrote: @benoitc <https://github.com/benoitc> how's it going? We are kind of stuck working around that issue, I guess we will downgrade to 1.15.2 in the meantime, or use another library.. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#683 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAADRIQOVV7FCTS7EB22TF3TJVQRPANCNFSM4ZZHTGIQ> .

-- Sent from my Mobile

Aduril · 2021-05-05T13:59:43Z

Hello @benoitc,
Hello @sircinek,

thanks for pointing out that error and thanks for your work so far. Were sitting in the same boat here, so I was wondering if there is any way to help you out? Me and my team could volunteer to test it on our platform, if that helps you :)

Kind regards

benoitc · 2021-05-05T20:19:57Z

i am slowly adding the tests. I have been side tracked AN tests will be helpful. I will try to push it by friday now.

…

On Wed, May 5, 2021 at 4:00 PM Peter Melinat ***@***.***> wrote: Hello @benoitc <https://github.com/benoitc>, Hello @sircinek <https://github.com/sircinek>, thanks for pointing out that error and thanks for your work so far. Were sitting in the same boat here, so I was wondering if there is any way to help you out? Me and my team could volunteer to test it on our platform, if that helps you :) Kind regards — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#683 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAADRISMJNIC4JHCVSKGNF3TMFFORANCNFSM4ZZHTGIQ> .

Aduril · 2021-05-18T07:50:05Z

Hello @benoitc,

I don't want to annoy you and appricate the work you are doing here, but is there any way to move some obstacles out of your way? If you prefer you can also DM me (for example on Twitter).

Kind regards
Peter

benoitc · 2021-05-19T22:19:55Z

Hello @benoitc,

I don't want to annoy you and appricate the work you are doing here, but is there any way to move some obstacles out of your way? If you prefer you can also DM me (for example on Twitter).

Kind regards
Peter

@Aduril as I quickly told you online I will describe in a separate process what's the roadmap on and a way to help the project to ensure we get a release more often.

In the mean time please try the following hot-fix 592a007 which should fix this issue. Please let me know!

benoitc · 2021-05-19T22:20:17Z

to all I pushed the hot-fix 592a007 which should fix this issue. Please let me know!

benoitc · 2021-05-20T22:30:08Z

the patch above seems to fix it according to my tests. Closing the issue then. Feel free to reopen it if needed.

ettomatic · 2021-05-27T09:57:51Z

Thanks a lot for this @benoitc! Are you planning a new release for this?

benoitc · 2021-05-27T17:33:17Z

yes it will this week once the CI is fixed :)

On Thu 27 May 2021 at 11:58, Ettore Berardi ***@***.***> wrote: Thanks a lot for this @benoitc <https://github.com/benoitc>! Are you planning a new release for this? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#683 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAADRIWPMJ7XGU52SMLS3STTPYJS5ANCNFSM4ZZHTGIQ> .

-- Sent from my Mobile

josealejandromu · 2021-09-20T09:44:21Z

Hello @benoitc ! Is there going to be a release soon to solve this issue ?

benoitc · 2021-09-22T20:40:49Z

yes this is planned along other announcements by Friday.

benoitc self-assigned this Mar 25, 2021

benoitc added high priority working on it labels Mar 25, 2021

benoitc added the pool label May 19, 2021

benoitc mentioned this issue May 19, 2021

make checkout synchronous again #686

Merged

benoitc closed this as completed May 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hackney 1.17.4 connections stuck in a pool #683

hackney 1.17.4 connections stuck in a pool #683

sircinek commented Mar 25, 2021

benoitc commented Mar 25, 2021 via email

sircinek commented Mar 25, 2021 •

edited

Loading

sircinek commented Mar 25, 2021 •

edited

Loading

sircinek commented Mar 26, 2021 •

edited

Loading

sircinek commented Mar 26, 2021 •

edited

Loading

benoitc commented Mar 26, 2021 via email

sircinek commented Mar 29, 2021 •

edited

Loading

benoitc commented Mar 30, 2021

sircinek commented Mar 30, 2021

sircinek commented Apr 1, 2021

benoitc commented Apr 1, 2021 via email

sircinek commented Apr 7, 2021

benoitc commented Apr 7, 2021 via email

benoitc commented Apr 13, 2021

sircinek commented Apr 20, 2021

benoitc commented Apr 21, 2021 via email

Aduril commented May 5, 2021

benoitc commented May 5, 2021 via email

Aduril commented May 18, 2021

benoitc commented May 19, 2021

benoitc commented May 19, 2021

benoitc commented May 20, 2021

ettomatic commented May 27, 2021

benoitc commented May 27, 2021 via email

josealejandromu commented Sep 20, 2021

benoitc commented Sep 22, 2021

hackney 1.17.4 connections stuck in a pool #683

hackney 1.17.4 connections stuck in a pool #683

Comments

sircinek commented Mar 25, 2021

Problem

Version/usage

Observations

benoitc commented Mar 25, 2021 via email

sircinek commented Mar 25, 2021 • edited Loading

sircinek commented Mar 25, 2021 • edited Loading

sircinek commented Mar 26, 2021 • edited Loading

sircinek commented Mar 26, 2021 • edited Loading

benoitc commented Mar 26, 2021 via email

sircinek commented Mar 29, 2021 • edited Loading

benoitc commented Mar 30, 2021

sircinek commented Mar 30, 2021

sircinek commented Apr 1, 2021

benoitc commented Apr 1, 2021 via email

sircinek commented Apr 7, 2021

benoitc commented Apr 7, 2021 via email

benoitc commented Apr 13, 2021

sircinek commented Apr 20, 2021

benoitc commented Apr 21, 2021 via email

Aduril commented May 5, 2021

benoitc commented May 5, 2021 via email

Aduril commented May 18, 2021

benoitc commented May 19, 2021

benoitc commented May 19, 2021

benoitc commented May 20, 2021

ettomatic commented May 27, 2021

benoitc commented May 27, 2021 via email

josealejandromu commented Sep 20, 2021

benoitc commented Sep 22, 2021

sircinek commented Mar 25, 2021 •

edited

Loading

sircinek commented Mar 25, 2021 •

edited

Loading

sircinek commented Mar 26, 2021 •

edited

Loading

sircinek commented Mar 26, 2021 •

edited

Loading

sircinek commented Mar 29, 2021 •

edited

Loading