-
Notifications
You must be signed in to change notification settings - Fork 467
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Client doesn't recover from remote connection resets #2129
Fix Client doesn't recover from remote connection resets #2129
Conversation
…om-remote-connection-resets
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One way of trying to reproduce it is to have a long running pipeline in a remote orchestrator
Yeah, @AlexejPenner gave me that hint, but I was not lucky in catching this still. |
self._session.adapters["http://"].max_retries = 5 | ||
self._session.adapters["https://"].max_retries = 5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure this is ok ? From what I can tell based on the Adapter constructor, this field needs to be set to a urllib.util.retry.Retry
object instead of an integer:
def __init__(
self,
pool_connections=DEFAULT_POOLSIZE,
pool_maxsize=DEFAULT_POOLSIZE,
max_retries=DEFAULT_RETRIES,
pool_block=DEFAULT_POOLBLOCK,
):
if max_retries == DEFAULT_RETRIES:
self.max_retries = Retry(0, read=False)
else:
self.max_retries = Retry.from_int(max_retries)
The second question is what exactly is being retried ? Does an HTTP error like 404 Not Found or 403 Forbidden also trigger the client to retry the request ? All HTTP errors are treated separately by ZenML code, and therefore this retry mechanism should only be triggered by TCP errors, such as a connection being closed unexpectedly or timing out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I implemented something similar here. The Retry
object also allows you to specify in which cases it retries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it works also with int
, but then sets it as total
. I updated it to retry only on connect
issues.
E2E template updates in |
…om-remote-connection-resets
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haha dont know why im tagged but ill approve it as it looks good to me
Cause you voluntarily commented before 🙂 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright !
…om-remote-connection-resets
E2E template updates in |
…om-remote-connection-resets
NLP template updates in |
Describe changes
I tried to fix
('Connection aborted.', RemoteDisconnected('Remote end closed connection without response')
error.This bug is hardly reproducible, so I didn't come up with any test case to cover it. I would appreciate some community testing before merging this.
Pre-requisites
Please ensure you have done the following:
develop
and the open PR is targetingdevelop
. If your branch wasn't based on develop read Contribution guide on rebasing branch to develop.Types of changes