Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

set the Connection close header #1154

Merged
merged 1 commit into from
May 9, 2018
Merged

set the Connection close header #1154

merged 1 commit into from
May 9, 2018

Conversation

boumenot
Copy link
Member

@boumenot boumenot commented May 8, 2018

There is a slow-ish memory leak in the agent's HTTP request library. The leak has been confirmed on OpenSuSE and Debian thus far. I have not been able to reproduce it on Ubuntu, RHEL, or CentOS.

I created a branch with tracemalloc instrumentation to help track down the issue. The leak appeared with the following file and line context.

Top 50 lines
#1: python3.5/socket.py:647: 13609.4 KiB
    self._sock = None

This was just useful enough to identify that the issue had to do with sockets. (tracemalloc supports grabbing a larger stacktrace around the memory allocation source, but I was not able to get that additional context. I have not debugged why.) The couple of bare socket calls that the agent makes are infrequent. The agent does make a lot of HTTP requests, so I started with the HTTP library.

The first thing I noticed is that the agent never explicitly cleans up an HTTP connection when it is done with it. Note that the following method never calls resp.connection.close().

    def put_block_blob(self, url, data):
        logger.verbose("Put block blob")
        headers = self.get_block_blob_headers(len(data))
        resp = self.client.call_storage_service(restutil.http_put, url, data, headers)
        if resp.status != httpclient.CREATED:
            raise UploadError(
                "Failed to upload block blob: {0}".format(resp.status))

I thought this was fine. I would expect Python to clean up once all references were removed, but this does not appear to be true at least in the case of Debian/OpenSuSE. I could fix the code to explicitly close the connection, but there are many places and branches where the agent makes HTTP requests. The HTTP library returns the HTTPResponse and not the connection, which is generally fine because you can usually reach the connection via the response. Another fix, that appears to be working, is to explicitly set the HTTP Connection header to close.

In my travels, I found that the Python requests library also had this issue. Their fix was to explicitly close the connection after use. I think that we should make that fix too, but that is a bigger and more invasive change. I would prefer to ship this fix as a hotfix, and leave the bigger fix for another release.

Copy link
Member

@hglkrijger hglkrijger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, have you verifies this reduces the leak?

@jasonzio
Copy link
Member

jasonzio commented May 8, 2018

I think there's a close() missing in restutil.http_request (restutil.py line 324 right before the final raise HttpError(...) at the end of the function). If a retryable HTTP request hits the maximum number of attempts, it looks to me like the connection leaks.

@boumenot
Copy link
Member Author

boumenot commented May 8, 2018

There are close() statements missing all over the code. My leaks were not caused by HTTP errors, they were caused by normal operations. I think we should go back and fix all of that code, but that's a bigger change than I want to make for a hotfix.

Copy link
Member

@jasonzio jasonzio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change makes things substantially better, does not make anything worse, and is therefore either beneficial or benign; I'm good with that for now. Longer-term, as you point out, we need to change things to ensure close gets called every dang time.

@boumenot
Copy link
Member Author

boumenot commented May 9, 2018

I edit my original comment and removed the Closes keyword. This change appears to address the memory leak, but I am still observing a slow upward trend after 12 hours. It is unclear if it is reaching a steady state, or memory has been lost and outside of tracemalloc's view.

The real fix is #1156, but that change is bigger than I want for a hotfix.
This PR addresses #1092 to an extent.

@boumenot boumenot merged commit 423dc18 into master May 9, 2018
@boumenot boumenot deleted the pr-memory-leak branch May 9, 2018 19:10
boumenot added a commit that referenced this pull request May 9, 2018
set the Connection close header
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants