-
Notifications
You must be signed in to change notification settings - Fork 433
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hackney 1.17.4 connections stuck in a pool #683
Comments
Can you provide a snippet on how you make your requests?
On Thu 25 Mar 2021 at 13:16, sircinek ***@***.***> wrote:
Problem
Connections getting stuck and never being checkin again exhausting the
pool (making it unusable without restart).
Version/usage
hackney 1.17.4 as an adapter to tesla 1.3.3 but we have seen similar
problems with httpoison running hackney 1.17.4 too
erlang 23.0.4
elixir 1.10.4-otp-23
Observations
***@***.***)4> Application.loaded_applications() |> Enum.find(fn x -> elem(x,0) == :hackney end)
{:hackney, 'simple HTTP client', ***@***.***)5> :hackney_pool.get_stats "auth_client_connection_pool"
[
name: "auth_client_connection_pool",
max: 50,
in_use_count: 50,
free_count: 0,
queue_count: 27
]
It looks like after 24~hours of uptime we were able to hit issue looking
similar to what's tried to be fixed in #681
<#681>
I am not sure it is the same problem, as we are unable to reproduce it
consistently, nor we understand all the circumstances around the problem.
Looking at our metrics, I see some requests timing out/failing prior to
this happening, but nothing conclusive.
All the processes that have connections checked out or are in a queued
are no longer alive and it's been like that for over 24hour now.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#683>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAADRIUAIYAFL3S52XKIP63TFMSRVANCNFSM4ZZHTGIQ>
.
--
Sent from my Mobile
|
As I said, we are using case hackney:request(post, <<"http://some_service:5060/endpoint">>, Headers, Body, [{pool, <<"custom_pool">>}]) of
{ok, Status, Headers, Ref} when is_reference(Ref) ->
case hackney:body(Ref) of
{ok, Body} -> {ok, Status, Headers, Body};
Err -> Err
end;
{ok, Status, Headers} ->
{ok, Status, Headers, []};
{error, _} = Error ->
Error
end I did the trace of 12:59:06.308032 {:trace, #PID<0.670.0>, :call, {:hackney, :request, [:post, "http://some_service:5060/endpoint", [{"content-type", "application/json"}], "{\"some_request\":\"some_data\"}", [pool: "custom_pool"]]}, {Tesla.Adapter.Hackney, :request, 5}}
12:59:06.813365 {:trace, #PID<0.670.0>, :return_from, {:hackney, :request, 5}, {:ok, 404, [{"Content-Type", "application/json; charset=utf-8"}, {"Content-Length", "158"}, {"Date", "Thu, 25 Mar 2021 12:59:06 GMT"}, {"Server", "some_server"}], #Reference<0.275978613.3853516803.200933>}}
12:59:06.814174 {:trace, #PID<0.670.0>, :call, {:hackney, :body, [#Reference<0.275978613.3853516803.200933>]}, {Tesla.Adapter.Hackney, :handle, 1}}
12:59:06.814637 {:trace, #PID<0.670.0>, :return_from, {:hackney, :body, 1}, {:ok, "{\"error\":\"some_error\"}"}} I did the same excercise from the node that have exhausted pool (all connections in use) and I see: 13:16:35.501878 {:trace, #PID<0.16252.376>, :call, {:hackney, :request, [:post, "http://some_service:5060/endpoint", [{"content-type", "application/json"}], "{\"some_request\":\"some_data\"}", [pool: "custom_pool"]]}, {Tesla.Adapter.Hackney, :request, 5}}
13:16:36.502299 {:trace, #PID<0.16252.376>, :exception_from, {:hackney, :request, 5}, {:exit, :shutdown}}
13:16:36.524614 {:trace, #PID<0.16533.376>, :call, {:hackney, :request, [:post, "http://some_service:5060/endpoint", [{"content-type", "application/json"}], "{\"some_request\":\"some_data\"}", [pool: "custom_pool"]]}, {Tesla.Adapter.Hackney, :request, 5}}
13:16:37.525295 {:trace, #PID<0.16533.376>, :exception_from, {:hackney, :request, 5}, {:exit, :shutdown}}
13:16:37.556506 {:trace, #PID<0.16487.376>, :call, {:hackney, :request, [:post, "http://some_service:5060/endpoint", [{"content-type", "application/json"}], "{\"some_request\":\"some_data\"}", [pool: "custom_pool"]]}, {Tesla.Adapter.Hackney, :request, 5}}
13:16:38.557300 {:trace, #PID<0.16487.376>, :exception_from, {:hackney, :request, 5}, {:exit, :shutdown}} We are using EDIT: iex(app@localhost)40> :hackney.request(:post, "http://some_service:5060/endpoint"", [{"content-type", "application/json"}], {\"some_request\":\"some_data\"}", [pool: "custom_pool"])
{:error, :checkout_timeout} Looks like Tesla is messing up with something, I will investigate further |
Given how Tesla middlewares interact (especially Tesla.Middleware.Timeout)with the hackney request, the snippet should be more like: Retries = X,
RertryDelay = Y,
Timeout = Z,
Requester = self(),
Request = fun() ->
Result = case hackney:request(post, <<"http://some_service:5060/endpoint">>, Headers, Body, [{pool, <<"custom_pool">>}]) of
{ok, Status, Headers, Ref} when is_reference(Ref) ->
case hackney:body(Ref) of
{ok, Body} -> {ok, Status, Headers, Body};
Err -> Err
end;
{ok, Status, Headers} ->
{ok, Status, Headers, []};
{error, _} = Error ->
Error
end,
Requester ! {result, Result}
end,
HandleSingleRequest = fun() ->
try
{ok, Task} = spawn_link(Request),
receive
{result, _} = Res->
throw(Res)
after
Timeout ->
exit(Task, shutdown)
{error, timeout}
end
catch
exit:{timeout, _} ->
{error, timeout}
exit:shutdown ->
{error, timeout}
end,
try
hd([begin R = HandleSingleRequest(), timer:sleep(RetryDelay), R end || _ <- lists:seq(0,Retries)])
catch
throw:{result, Result} ->
Result
end forgive me the typos/syntax errors, haven't done Erlang in some time ;) |
If it helps with investigation, it seems that always happen once requests starts to be queued, I haven't seen it occurring without any To try to aid the problem for us, I will make |
Another observation. On the node where problem reproduced - if I increase iex(app@localhost)20> :hackney_pool.get_stats("auth_client_connection_pool")
[
name: "auth_client_connection_pool",
max: 51,
in_use_count: 50,
free_count: 0,
queue_count: 26
] when I repeat the exercise I get another one out of the queue, but requests do not continue to get dequeued once connections are returned... 🤔 |
Thanks a lot for the feedback. I think I understand now what’s happening.
I will fix it tonight/this week-end :)
On Fri 26 Mar 2021 at 13:32, sircinek ***@***.***> wrote:
Another observation. If I increase max_connections limit for the pool by
one, nothing changes. When I make a request (successful because I have
capacity in the pool), I see one of the queued elements gets removed from
the queue (but process that was requesting is not alive anymore), and the
state of the pool changes to be like this:
***@***.***)20> :hackney_pool.get_stats("auth_client_connection_pool")
[
name: "auth_client_connection_pool",
max: 51,
in_use_count: 50,
free_count: 0,
queue_count: 26
]
when I repeat the exercise I get another one out of the queue, but
requests do not continue to get dequeued.. once connections are returned
🤔
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
<#683 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAADRIW56S4FTMDU36F2UQ3TFR5ERANCNFSM4ZZHTGIQ>
.
--
Sent from my Mobile
|
Could you elaborate where is the problem ? Decreasing |
@sircinek better to talk over the code :) That should land as a PR later today or in the morning tomorrow finally. Roughly speaking spawning process to checkout the connection is actually breaking the original design that was using the manager and expect the connection beeing synchronously checkout by the requester. The new design is putting back more synchronicity. While I am here I am fixing SSL handling. As I said I will post the code ASAP today/tomorrow morning (CET) :) |
Wonderful. Thank you so much for looking into this 🙇 |
Hey @benoitc, any updates on this? :) |
this is still in progress, I should be able to push a branch later today.
On Thu 1 Apr 2021 at 09:52, sircinek ***@***.***> wrote:
Hey @benoitc <https://github.com/benoitc>, any updates on this? :)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#683 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAADRIRPUEYTTDWV4SCUZJLTGQQ27ANCNFSM4ZZHTGIQ>
.
--
Sent from my Mobile
|
@benoitc maybe you need a hand with this ? |
Actually this will land at the end of the week. Got sidetracked by a
refactoring. I wouldn’t mind some review when it’s out anyway :)
On Wed 7 Apr 2021 at 14:34, sircinek ***@***.***> wrote:
@benoitc <https://github.com/benoitc> maybe you need a hand with this ?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#683 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAADRIQOUU2WKUM7FM6IPCLTHRGNNANCNFSM4ZZHTGIQ>
.
--
Sent from my Mobile
|
short update, but a new branch will land tomorrow. i’ve been sidetracked in between. |
@benoitc how's it going? We are kind of stuck working around that issue, I guess we will downgrade to |
Sorry it take more time than expected. I’m testing a solution on latest
OTP/Erlang since one week... I will make it available by friday.
One way to speed its dev is to allow me to dedicate full time on it btw...
On Tue 20 Apr 2021 at 13:29, sircinek ***@***.***> wrote:
@benoitc <https://github.com/benoitc> how's it going? We are kind of
stuck working around that issue, I guess we will downgrade to 1.15.2 in
the meantime, or use another library..
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#683 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAADRIQOVV7FCTS7EB22TF3TJVQRPANCNFSM4ZZHTGIQ>
.
--
Sent from my Mobile
|
i am slowly adding the tests. I have been side tracked AN tests will be
helpful. I will try to push it by friday now.
…On Wed, May 5, 2021 at 4:00 PM Peter Melinat ***@***.***> wrote:
Hello @benoitc <https://github.com/benoitc>,
Hello @sircinek <https://github.com/sircinek>,
thanks for pointing out that error and thanks for your work so far. Were
sitting in the same boat here, so I was wondering if there is any way to
help you out? Me and my team could volunteer to test it on our platform, if
that helps you :)
Kind regards
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#683 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAADRISMJNIC4JHCVSKGNF3TMFFORANCNFSM4ZZHTGIQ>
.
|
@Aduril as I quickly told you online I will describe in a separate process what's the roadmap on and a way to help the project to ensure we get a release more often. In the mean time please try the following hot-fix 592a007 which should fix this issue. Please let me know! |
to all I pushed the hot-fix 592a007 which should fix this issue. Please let me know! |
the patch above seems to fix it according to my tests. Closing the issue then. Feel free to reopen it if needed. |
Thanks a lot for this @benoitc! Are you planning a new release for this? |
yes it will this week once the CI is fixed :)
On Thu 27 May 2021 at 11:58, Ettore Berardi ***@***.***> wrote:
Thanks a lot for this @benoitc <https://github.com/benoitc>! Are you
planning a new release for this?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#683 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAADRIWPMJ7XGU52SMLS3STTPYJS5ANCNFSM4ZZHTGIQ>
.
--
Sent from my Mobile
|
Hello @benoitc ! Is there going to be a release soon to solve this issue ? |
yes this is planned along other announcements by Friday. |
Problem
Connections getting stuck and never being checkin again exhausting the pool (making it unusable without restart).
Version/usage
hackney 1.17.4
as an adapter totesla 1.3.3
but we have seen similar problems withhttpoison
running hackney 1.17.4 tooerlang 23.0.4
elixir 1.10.4-otp-23
Observations
It looks like after 24~hours of uptime we were able to hit issue looking similar to what's tried to be fixed in #681
I am not sure it is the same problem, as we are unable to reproduce it consistently, nor we understand all the circumstances around the problem. Looking at our metrics, I see some requests timing out/failing prior to this happening, but nothing conclusive.
All the processes that have connections
checked out
or are in aqueued
are no longer alive and it's been like that for over 24hour now.The text was updated successfully, but these errors were encountered: