Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent NetworkError #378

Open
phillipadsmith opened this issue Mar 24, 2020 · 32 comments
Open

Intermittent NetworkError #378

phillipadsmith opened this issue Mar 24, 2020 · 32 comments
Labels
bug? V3 v2019-10-10 Client

Comments

@phillipadsmith
Copy link

phillipadsmith commented Mar 24, 2020

Warning: Github issues are not an official support channel for Recurly. If you have an urgent request we suggest you contact us through an official channel: support@recurly.com or https://recurly.zendesk.com

Describe the bug

I've built a simple integration using this library and I am seeing an intermittent recurly.NetworkError exception with the message "Remote end closed connection without response"

This appears to be happening most often for this API call: client.list_accounts()

To Reproduce

You can see the call in context here:
https://github.com/TheTyee/manage-account.thetyee.ca/blob/master/app.py#L56-L79

Expected behavior

For it to work consistently! :)

Your Environment

  • Which version of this library are you using?
    Recurly Python client 3.0

  • Which version of the language are you using?
    Python 3.8.2

@bhelx bhelx self-assigned this Mar 24, 2020
@bhelx
Copy link
Contributor

bhelx commented Mar 24, 2020

Thanks for the issue @phillipadsmith . We are currently working on making the clients more resilient and pushed out some changes in ruby and node this week to address this problem. The client should be able to gracefully handle this by retrying on "open" timeouts. Apologies for the problems this is causing. We can prioritize fixing python next.

@bhelx
Copy link
Contributor

bhelx commented Mar 24, 2020

Any other details you are seeing the response? I want to make 100% sure we're gonna solve the right problem.

@phillipadsmith
Copy link
Author

phillipadsmith commented Mar 24, 2020

Any other details you are seeing the response? I want to make 100% sure we're gonna solve the right problem.

Hi @bhelx -- thanks for the quick reply. That's all of the information I'm getting back from the exception. If you have a suggestion for how I can get a more verbose error message, I'd be happy to try it out.

@bhelx
Copy link
Contributor

bhelx commented Mar 24, 2020

Sure, let me know if you have any more info. We'll keep moving forward with the expected fix in the mean time.

@bhelx
Copy link
Contributor

bhelx commented Mar 24, 2020

@bhelx
Copy link
Contributor

bhelx commented Mar 24, 2020

If you're just trying to get the first / only account that matches, this will help you when it lands: #367

@phillipadsmith
Copy link
Author

@phillipadsmith what's the purpose of limit=1 here? https://github.com/TheTyee/manage-account.thetyee.ca/blob/master/app.py#L57

I was hoping it would shorten the response time to only request one record. And, I'm only searching for one record (and only one should ever be returned).

If you're just trying to get the first / only account that matches, this will help you when it lands: #367

Ah-ha. Yes, that's what I'm trying to do. I'll keep an eye out for that release.

@bhelx
Copy link
Contributor

bhelx commented Mar 24, 2020

Great, just checking. That's an appropriate way to use it. There has been a little confusion over what limit means so i wanted to be sure.

@phillipadsmith
Copy link
Author

@bhelx What was the fix implemented in the Ruby and Node modules? Is there a similar tactic I could implement in my version of the module, or in the application itself, e.g., catching the exception and trying the request again some limited number of times?

@phillipadsmith
Copy link
Author

@bhelx And what is the root cause of the timeouts? Is it the Recurly API not responding? I've never experienced this timeout when using the API endpoints directly with Postman or CURL or similar.

@bhelx
Copy link
Contributor

bhelx commented Mar 24, 2020

@phillipadsmith I can't say what the root cause is just yet. At scale though, you're gonna have some random TCP-level blips. Most of the time, it's something that can be recovered from. Some http libraries handle this for you under the hood and postman may be doing that as well. We are trying our best to keep these clients dependency free for security and maintainability reasons so we are implementing some of this ourselves.
Here is the ruby change that went out: recurly/recurly-client-ruby#568
There isn't a clean way to do this with what we have in master, but keep in mind that you can safely retry GET requests as many times as you want. So you could catch a network error, sleep a few milliseconds, and retry some number of times.

@bhelx
Copy link
Contributor

bhelx commented Mar 24, 2020

Again sorry about this, I'll be working on implementing and testing this the next couple days.

@phillipadsmith
Copy link
Author

Again sorry about this, I'll be working on implementing and testing this the next couple days.

@bhelx No need to apologize. I appreciate the quick replies and it's helpful just to hear that I'm not crazy and that this is a known issue. :)

@phillipadsmith
Copy link
Author

phillipadsmith commented Mar 25, 2020

Couple of quick questions, @bhelx:

At scale though, you're gonna have some random TCP-level blips. Most of the time, it's something that can be recovered from.

True enough, but I've only got this on a test server at the moment, so we're not running any traffic against this yet -- it's just me and one other other person testing, and the exception happens quite often (almost too often, IMHO, to be a blip -- almost like there's another issue at play).

Some http libraries handle this for you under the hood and postman may be doing that as well.

Yes, interestingly, I've never had this happen when running the app from Flask, and I've never seen it in Postman.

And it never happens immediately after I start the app with Gunicorn, only after some time has passed, perhaps hours. Then it happens very frequently and very consistently until I restart the application.

Is there anything about running this library in a pre-fork WSGI server that might be introducing this network error?

@bhelx
Copy link
Contributor

bhelx commented Mar 25, 2020

it's just me and one other other person testing, and the exception happens quite often (almost too often, IMHO, to be a blip -- almost like there's another issue at play).

This is something I personally see more on the order of 1 out of ~10,000 requests. If you are able to reproduce, could you try to print out some more information from the underlying socket error?
From the message you pasted, it sounds like a RemoteDisconnected error but there may be some more info in the message we can use to diagnose.

Is there anything about running this library in a pre-fork WSGI server that might be introducing this network error?

So your webserver is multiprocess and not multithreading right? It is possible I suppose. I'm not sure how python would handle forking an instance an Client with an already open connection: https://github.com/TheTyee/manage-account.thetyee.ca/blob/master/app.py#L44

Since forking is copy-on-write, my gut would tell me that you want to create the connection after you fork. You could test this theory by trying to create the client inside of the accounts_search method. This wouldn't be the recommended way for production because you want to re-use the connection where possible, but if the problem goes away you'll have more evidence that forking is causing the issue.

@bhelx
Copy link
Contributor

bhelx commented Mar 25, 2020

@phillipadsmith i'm not sure if flask has a special way to deal with this, but also consider using a "post-fork" hook of some kind: https://docs.pylonsproject.org/projects/pyramid_cookbook/en/latest/deployment/forked_threaded_servers.html?highlight=sqlalchemy

Perhaps flask has a way of letting you run some code after a fork happens.

@phillipadsmith
Copy link
Author

Thanks for those thoughts, @bhelx, appreciated.

This is something I personally see more on the order of 1 out of ~10,000 requests.

Oh, yes, definitely able to get it 8/10 times right now.

could you try to print out some more information from the underlying socket error?

Is there an easy way to get more verbose output from the exception? My assumption was that it was coming from the Recurly Python client.

Many thanks in advance!

@bhelx
Copy link
Contributor

bhelx commented Mar 25, 2020

Oh, yes, definitely able to get it 8/10 times right now.

Oh, yeah that suggests a different problem entirely. Although I think the fix I'm working on might help. But I think we need to address the root problem.

Is there an easy way to get more verbose output from the exception? My assumption was that it was coming from the Recurly Python client.

A stack trace would be helpful. Let it explode. When i evoke one it looks like this:

Traceback (most recent call last):
  File "./clients/python/recurly/base_client.py", line 42, in _make_request
    http_resp = self.__conn.request(method, path, body, headers=headers).getresponse()
  File "/usr/lib/python3.7/http/client.py", line 1244, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/lib/python3.7/http/client.py", line 1290, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.7/http/client.py", line 1239, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.7/http/client.py", line 1026, in _send_output
    self.send(msg)
  File "/usr/lib/python3.7/http/client.py", line 966, in send
    self.connect()
  File "/usr/lib/python3.7/http/client.py", line 1399, in connect
    super().connect()
  File "/usr/lib/python3.7/http/client.py", line 938, in connect
    (self.host,self.port), self.timeout, self.source_address)
  File "/usr/lib/python3.7/socket.py", line 707, in create_connection
    for res in getaddrinfo(host, port, 0, SOCK_STREAM):
  File "/usr/lib/python3.7/socket.py", line 748, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -2] Name or service not known

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "scripts/v2019-10-10/python/create_account.py", line 33, in <module>
    account = client.create_account(account_create)
  File "./clients/python/recurly/client.py", line 133, in create_account
    return self._make_request("POST", path, body, None)
  File "./clients/python/recurly/base_client.py", line 71, in _make_request
    raise NetworkError(e)
recurly.NetworkError: [Errno -2] Name or service not known

@phillipadsmith
Copy link
Author

👍Let me see if I can get a stack trace.

@phillipadsmith
Copy link
Author

I've imported traceback and I'm trying to get the error again 🤞

@phillipadsmith
Copy link
Author

@bhelx Hey, I'm not getting any output from traceback when an exception occurs: https://github.com/TheTyee/manage-account.thetyee.ca/blob/master/app.py#L79-L80

Any idea what I've got wrong here?

@bhelx
Copy link
Contributor

bhelx commented Mar 25, 2020

I'm not really familiar with it's usage, what happens if you just remove the whole rescue block for network exceptions? that's what i did in my script. python dumps the stack trace for you.

@phillipadsmith
Copy link
Author

🤦‍♂ Didn't even think of that.

@bhelx
Copy link
Contributor

bhelx commented Mar 25, 2020

@phillipadsmith did you look into my earlier comment about creating the client after fork? I'm reading some people recommend you use the before_first_request decorator to make sure that a database or http client connection is created in each fork:

I'm not sure how to reproduce your problem so I can't test this assumption on your app.

@phillipadsmith
Copy link
Author

did you look into my earlier comment about creating the client after fork? I'm reading some people recommend you use the before_first_request decorator to make sure that a database or http client connection is created in each fork:

I will read up on those and give them a try also. I'm still trying to get a useful stack trace!

Thanks @bhelx, appreciated.

@phillipadsmith
Copy link
Author

did you look into my earlier comment about creating the client after fork? I'm reading some people recommend you use the before_first_request decorator to make sure that a database or http client connection is created in each fork:

Thanks for you help, @bhelx.

I've made an update to the app so that Gunicorn will receive a post_fork hook that creates a Recurly client.

I haven't tested extensively, but I'm currently not able to trigger the error. I'll keep you updated.

@phillipadsmith
Copy link
Author

Hey @bhelx!

Quick update: So the post_fork didn't fix the issue, I continued to get the network error Remote end closed connection without response error messages.

I'm now trying the approach of creating the client inside the account_search method.

I'll read up on the before_first_request hook also.

🤞

@phillipadsmith
Copy link
Author

Happy Saturday @bhelx,

Just a quick update on this:

  • As crazy as it seemed, I tested your idea of [creating the client inside of each method].(https://github.com/TheTyee/manage-account.thetyee.ca/blob/master/app.py)
  • I then ran automated tests overnight, testing the complete flow that triggers each method.
  • After 12 hours of testing, ~600 tests, no errors have been reported. So, it seems like that resolved the intermitted network error that we were experiencing.

As you mention, this is not an ideal way to have something set-up for a production application, and yet this is a very low use application so I believe we're going to try it out in production.

I was never able to trigger a useful stack trace in part because Flask doesn't surface its errors up to Gunicorn, and the network errors only happened when running the application using Gunicorn, so it was a bit of a rabbit hole trying to figure out how to work around that while also hoping for the error to surface again.

All that said, I hope it's useful to know that there appears to be some kind of a intermitted network error being thrown by the Recurly client in this rather common configuration of Flask + Gunicorn.

Let me know if you have any questions. Know that we've got a work around that I can deploy for this project, I'm happy to go back and re-test things in my development environment.

Many thanks for your help & have a great weekend,
Phillip.

@bhelx
Copy link
Contributor

bhelx commented Apr 4, 2020

@phillipadsmith I appreciate you digging so deeply into it. It does sound like there is some kind of issue with Gunicorn and the open connection. I'd like to keep this issue open until we fix it. When my team gets a free moment, we'll try to make a minimal working example and see if we can reproduce. If it does turn out that there is a change that people will need to make in their Gunicorn apps, then we'll document it.

@joannasese joannasese added the V3 v2019-10-10 Client label Feb 11, 2021
@declaresub
Copy link

I encountered this same problem. In my case, the problem was using a single recurly.Client instance for multiple connections. Probably the recurly requests were stomping on each other. Using a recurly.Client instance per connection resolved the "Remote end closed connection without response" problems.

@bhelx bhelx removed their assignment Jun 11, 2021
@ridersofrohan
Copy link

@bhelx Can you fix this using the requests library and just creating a session?

@bhelx
Copy link
Contributor

bhelx commented Aug 21, 2022

@ridersofrohan I'm no longer at Recurly. I think the best person to contact would be @douglasmiller .

Regarding this code though. It doesn't use requests so you can't just set a session anywhere. It uses the native http.client module. I suspect a problem here is that we are creating one connection per client:

self.__conn = http.client.HTTPSConnection(

It might make more sense to use a connection pool of some kind in the base client rather than a single connection like we do in the other clients.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug? V3 v2019-10-10 Client
Projects
None yet
Development

No branches or pull requests

5 participants