Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Client doesn't recover from remote connection resets #2129

Conversation

avishniakov
Copy link
Contributor

@avishniakov avishniakov commented Dec 11, 2023

Describe changes

I tried to fix
('Connection aborted.', RemoteDisconnected('Remote end closed connection without response') error.

This bug is hardly reproducible, so I didn't come up with any test case to cover it. I would appreciate some community testing before merging this.

Pre-requisites

Please ensure you have done the following:

  • I have read the CONTRIBUTING.md document.
  • If my change requires a change to docs, I have updated the documentation accordingly.
  • If I have added an integration, I have updated the integrations table and the corresponding website section.
  • I have added tests to cover my changes.
  • I have based my new branch on develop and the open PR is targeting develop. If your branch wasn't based on develop read Contribution guide on rebasing branch to develop.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Other (add details above)

@github-actions github-actions bot added internal To filter out internal PRs and issues bug Something isn't working labels Dec 11, 2023
Copy link
Contributor

@htahir1 htahir1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One way of trying to reproduce it is to have a long running pipeline in a remote orchestrator

@avishniakov
Copy link
Contributor Author

One way of trying to reproduce it is to have a long running pipeline in a remote orchestrator

Yeah, @AlexejPenner gave me that hint, but I was not lucky in catching this still.

Comment on lines 3186 to 3187
self._session.adapters["http://"].max_retries = 5
self._session.adapters["https://"].max_retries = 5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure this is ok ? From what I can tell based on the Adapter constructor, this field needs to be set to a urllib.util.retry.Retry object instead of an integer:

    def __init__(
        self,
        pool_connections=DEFAULT_POOLSIZE,
        pool_maxsize=DEFAULT_POOLSIZE,
        max_retries=DEFAULT_RETRIES,
        pool_block=DEFAULT_POOLBLOCK,
    ):
        if max_retries == DEFAULT_RETRIES:
            self.max_retries = Retry(0, read=False)
        else:
            self.max_retries = Retry.from_int(max_retries)

The second question is what exactly is being retried ? Does an HTTP error like 404 Not Found or 403 Forbidden also trigger the client to retry the request ? All HTTP errors are treated separately by ZenML code, and therefore this retry mechanism should only be triggered by TCP errors, such as a connection being closed unexpectedly or timing out.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I implemented something similar here. The Retry object also allows you to specify in which cases it retries.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it works also with int, but then sets it as total. I updated it to retry only on connect issues.

Copy link
Contributor

E2E template updates in examples/e2e have been pushed.

Copy link
Contributor

@htahir1 htahir1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha dont know why im tagged but ill approve it as it looks good to me

@avishniakov
Copy link
Contributor Author

Haha dont know why im tagged but ill approve it as it looks good to me

Cause you voluntarily commented before 🙂

Copy link
Contributor

@stefannica stefannica left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright !

Copy link
Contributor

E2E template updates in examples/e2e have been pushed.

Copy link
Contributor

NLP template updates in examples/nlp-case have been pushed.

@avishniakov avishniakov merged commit 3cb30fb into develop Dec 16, 2023
4 checks passed
@avishniakov avishniakov deleted the bugfix/OSS-2662-client-doesn-t-recover-from-remote-connection-resets branch December 16, 2023 12:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working internal To filter out internal PRs and issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants