-
Notifications
You must be signed in to change notification settings - Fork 141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory used during parsing never reclaimed #850
Comments
This is most likely due to threading. What happens if you use 1 thread (you can pass kwarg to force using only one thread) |
I tried using a single thread by starting julia with |
@baggepinnen, I'd appreciate if you could share the file with me and I can take a look. Could you also share how you're tracking memory usage? I've found it difficult to make meaningful progress on "memory issues" in the past because it was too difficult to see matching results on different OS/versions/etc. |
The file is here |
As an update, I can reproduce the memory seeming to not be reclaimed. After spending an hour or two digging around, I can't find any way the original data would still be reference, but the Julia GC seems to have a reference somehow. I've discussed the matter with core devs and it sounds like it will require doing some deeper GC debugging to try and figure out where/why the data is being referenced and thus not reclaimed. I'll dig in to this at some point, but I don't want it to hold up the next 0.9 release. |
I'm thrilled to hear that you got the attention of the core developers on this, it has bugged me for years. The problem is not specific to CSV.jl, but appear any time I read large data really. Thanks! |
Is there a JuliaLang issue tracking this? I believe that I've had similar issues with high memory usage not being reclaimed. |
Could this be related to JuliaLang/julia#42566 (comment)? |
That's my suspicion, because we certainly involve lots of "small vectors of vectors" in the internal parsing work. If someone has access to a windows machine, it seems like we could confirm that it's Not an issue here. I think I have an old windows machine somewhere in my basement I could try and dust off if no one else does. |
On Win10:
and
|
Thanks @bkamins; so it seems indeed that this is the same glibc issue linked. |
I have long been trying to find the source of what I suspected was a memory leak somewhere in my data pipeline and I think I have found a small MWE that reproduces my issue. If the following snippet is run multiple times, the memory use of the julia process increases steadily. If I call
GC.gc();GC.gc();GC.gc();GC.gc();GC.gc();GC.gc();
, I get some of it back but not all. If I continue to call theCSV.read
another, say, 10 times, the memory claimed bu the julia process jumps up again and when I trigger the GC again, I get even less memory back and the julia process now holds on to more of it. I can continue this process until I run out of RAMSome concrete numbers:
With julia just started and CSV and DataFrames loded, julia uses up 167MB, after the first read, the figure is 1.7GB. Running GC multiple times brings this to 1.2GB.
Repeating the reading 10 times brings the memory usage to 5.7GB, and triggering GC brings it down to 2.1GB. Why does it not go down back to 1.2GB here?
The csv-file I'm using is about 1GB and 160MB zipped, I'd be happy to share it if someone wants to reproduce this issue.
Edit: It's here https://drive.google.com/file/d/1LQSKbDIYHb_N8NqnD40Xw-13V1uTCSOk/view?usp=sharing
I'm running
Edit: I've noticed that each read says something like
i.e., it always claims a positive compilation time. Is it perhaps the compiled code that eventually eats up memory?
Edit2: The compilation time appears to be due to my
drop
function. Removing this removes the compilation time, but does not change the memory issues.The text was updated successfully, but these errors were encountered: