Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leaking from version 7? #4087

Open
norrs opened this issue Aug 20, 2019 · 9 comments
Open

Memory leaking from version 7? #4087

norrs opened this issue Aug 20, 2019 · 9 comments

Comments

@norrs
Copy link

norrs commented Aug 20, 2019

Issue Summary

After upgrading from redash5 to redash7, we notice redash webserver is leaking memory linear compared to redash5.

Steps to Reproduce

Deployed redash7 in kubernetes, watching metrics.

Technical details:

  • Redash Version: redash/redash:7.0.0.b18042
  • How did you install Redash: Helm

See helm/charts#5071 (comment) for my upgrade notes to existing merge request about a public redash helm chart for kubernetes.

Last hours:
2019-08-20-09-29-27-window-select

Last 30 days: (you can see when we did the upgrade at 6. august. The restarts is basically pods being restarting due to kubernetes evicting em to memory usage, using more then requested etc)

2019-08-20-10-23-56-window-select

@norrs
Copy link
Author

norrs commented Aug 20, 2019

Oh, I just noticed that f165168 is already commited to master.

I'll try the latest docker image and redeploy our redash7 to see if setting max requests and max request jitter has an effect. Even if it fixes the problem, I would say redash7 still has a memory leak issue.

But this is a "decent" work around for now, I'll post back how it goes.

@norrs
Copy link
Author

norrs commented Aug 20, 2019

Hm, https://hub.docker.com/r/redash/redash/tags doesn't seem to have any newer builds for 7.x that contains f165168 as I understand it.

Is there anything special that is required for this upgrade without maybe database migrations?
I see nothing in https://version.redash.io/ or https://github.com/getredash/redash/blob/master/CHANGELOG.md for redash8

@norrs
Copy link
Author

norrs commented Aug 20, 2019

Feature for max requests and max request jitter comes from #4013

@norrs
Copy link
Author

norrs commented Aug 21, 2019

2019-08-21-08-17-46-window-select

This seems to be a work around for the memory leak at least.

@arikfr
Copy link
Member

arikfr commented Sep 28, 2019

Last 30 days: (you can see when we did the upgrade at 6. august.

Was it the only change that happened around this time?

Btw, for reference, Gunicorn can accept additional command line options via an en var:

Settings can be specified by using environment variable GUNICORN_CMD_ARGS. All available command line arguments can be used. For example, to specify the bind address and number of workers:

$ GUNICORN_CMD_ARGS="--bind=127.0.0.1 --workers=3" gunicorn app:app

So you didn't have to upgrade to benefit from the new options.

@norrs
Copy link
Author

norrs commented Sep 30, 2019

@arikfr : That could be that I didn't need to upgrade to get the GUNICORN_CMD_ARGS, but what I found at the time was f165168 .

And having workers restarting clearly indicates something is leaking somewhere. Where, I don't know.

We are happy enough with having workers being restarted.

In regards if there was other things that happened around this time, only thing I can say is that
there probably were a few "start migration over again" because I did a silly mistake or two while upgrading, but the end results is what matters here, clearly shows pods increasing in memory usage until they were killed by out of memory.

Graphs are directly from our production environment.

ATM we are on some alpha build of redash8 I think which had a docker image with f165168 in it, and it has been working perfectly. So much gracious for a great product!

@arikfr
Copy link
Member

arikfr commented Nov 20, 2019

And having workers restarting clearly indicates something is leaking somewhere.

The reason we added auto restarts was because Python doesn't release memory. As query results are loaded into memory before being served, if a large query result was served it might cause the memory usage to go up.

@norrs
Copy link
Author

norrs commented Nov 27, 2019

@arikfr That doesn't answer the graphs memory usage. Earlier versions still had to load result sets into memory before serving em, and we didn't see the growing of memory usage as the application lived on in redash5. So yes, python doesn't release the memory, but that due to it's "garbage collector" cannot do its job I believe.

I suspect there is something holding references, so refcounting never gets cleared for the "garbage collector" to kick in and do its job. Circular references comes to mind. ( https://stackoverflow.com/questions/2428301/should-i-worry-about-circular-references-in-python )

Best way to hunt down the cause would be by bisecting between the two major versions, but you would need to make a reproducible test first.

sys.getrefcount(a) can be a tool in the search for it maybe.

The one who wants to devote time into finding exactly the cause, is a super hero! This work is both tricky and cumbersome.

For other finding this issue, restarting the gunicorn workers is a work-around working just fine.

(Again, Thanks for a superb product)

@arikfr
Copy link
Member

arikfr commented Nov 27, 2019

That doesn't answer the graphs memory usage.

Sure. I just mentioned that to give context to why we implemented the restart...

Thank you for the tips on how to hunt this down. We will probably wait with it until the Python 3 migration is done and some other ongoing things, but definitely would like to address it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants