Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory used during parsing never reclaimed #850

Closed
baggepinnen opened this issue Jul 1, 2021 · 11 comments
Closed

Memory used during parsing never reclaimed #850

baggepinnen opened this issue Jul 1, 2021 · 11 comments
Labels

Comments

@baggepinnen
Copy link

baggepinnen commented Jul 1, 2021

I have long been trying to find the source of what I suspected was a memory leak somewhere in my data pipeline and I think I have found a small MWE that reproduces my issue. If the following snippet is run multiple times, the memory use of the julia process increases steadily. If I call GC.gc();GC.gc();GC.gc();GC.gc();GC.gc();GC.gc();, I get some of it back but not all. If I continue to call the CSV.read another, say, 10 times, the memory claimed bu the julia process jumps up again and when I trigger the GC again, I get even less memory back and the julia process now holds on to more of it. I can continue this process until I run out of RAM

using CSV, DataFrames
logfile = "my_1GB_file.csv"
@time CSV.read(
    logfile,
    DataFrame;
    header = 15,
    datarow = 26,
    drop = (i, name) ->
        startswith(string(name), "Name") || startswith(string(name), "SymbolName"),
    delim = '\t',
    footerskip = 1,
);

Some concrete numbers:

With julia just started and CSV and DataFrames loded, julia uses up 167MB, after the first read, the figure is 1.7GB. Running GC multiple times brings this to 1.2GB.
Repeating the reading 10 times brings the memory usage to 5.7GB, and triggering GC brings it down to 2.1GB. Why does it not go down back to 1.2GB here?

The csv-file I'm using is about 1GB and 160MB zipped, I'd be happy to share it if someone wants to reproduce this issue.
Edit: It's here https://drive.google.com/file/d/1LQSKbDIYHb_N8NqnD40Xw-13V1uTCSOk/view?usp=sharing

I'm running

julia> versioninfo()
Julia Version 1.6.1
Commit 6aaedecc44 (2021-04-23 05:59 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i7-10510U CPU @ 1.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, skylake)
Environment:
  JULIA_NUM_THREADS = 4

(@v1.6) pkg> st CSV
      Status `~/.julia/environments/v1.6/Project.toml`
  [336ed68f] CSV v0.8.5

Edit: I've noticed that each read says something like

1.540763 seconds (793.73 k allocations: 1.238 GiB, 2.47% gc time, 19.43% compilation time)

i.e., it always claims a positive compilation time. Is it perhaps the compiled code that eventually eats up memory?


Edit2: The compilation time appears to be due to my drop function. Removing this removes the compilation time, but does not change the memory issues.

@bkamins
Copy link
Member

bkamins commented Jul 1, 2021

This is most likely due to threading. What happens if you use 1 thread (you can pass kwarg to force using only one thread)

@baggepinnen
Copy link
Author

I tried using a single thread by starting julia with -t 1 and this did not solve the issue. (the kwarg to turn off threading did not work for me #852)

@quinnj
Copy link
Member

quinnj commented Jul 6, 2021

@baggepinnen, I'd appreciate if you could share the file with me and I can take a look. Could you also share how you're tracking memory usage? I've found it difficult to make meaningful progress on "memory issues" in the past because it was too difficult to see matching results on different OS/versions/etc.

@baggepinnen
Copy link
Author

The file is here
https://drive.google.com/file/d/1LQSKbDIYHb_N8NqnD40Xw-13V1uTCSOk/view?usp=sharing
I was previously manually looking at how much memory the Julia process consumed in the system monitor, but I could add in a call to Sys.free_memory (or something like that) to get an exact number if that's helpful.

@quinnj quinnj added the bug label Aug 20, 2021
@quinnj
Copy link
Member

quinnj commented Aug 20, 2021

As an update, I can reproduce the memory seeming to not be reclaimed. After spending an hour or two digging around, I can't find any way the original data would still be reference, but the Julia GC seems to have a reference somehow. I've discussed the matter with core devs and it sounds like it will require doing some deeper GC debugging to try and figure out where/why the data is being referenced and thus not reclaimed. I'll dig in to this at some point, but I don't want it to hold up the next 0.9 release.

@baggepinnen
Copy link
Author

baggepinnen commented Aug 20, 2021

I'm thrilled to hear that you got the attention of the core developers on this, it has bugged me for years. The problem is not specific to CSV.jl, but appear any time I read large data really. Thanks!

@robsmith11
Copy link

Is there a JuliaLang issue tracking this? I believe that I've had similar issues with high memory usage not being reclaimed.

@baggepinnen
Copy link
Author

Could this be related to JuliaLang/julia#42566 (comment)?

@quinnj
Copy link
Member

quinnj commented Oct 23, 2021

That's my suspicion, because we certainly involve lots of "small vectors of vectors" in the internal parsing work. If someone has access to a windows machine, it seems like we could confirm that it's Not an issue here. I think I have an old windows machine somewhere in my basement I could try and dust off if no one else does.

@bkamins
Copy link
Member

bkamins commented Oct 23, 2021

On Win10:

  1. Started Julia 1.6.2 single threaded: 120MB footprint
  2. Loaded DataFrames.jl 1.2.2 and CSV.jl 0.9.4: 250 MB footprint
  3. Run GC many times: 222 MB footprint
  4. Run loading the file OP shared: 1838 MB footprint
  5. RUN GC many times: 418 MB footprint
  6. repeating steps 4 and 5 yields roughly the same values

and

  1. Started Julia 1.6.2 with 4 threads: 125MB footprint
  2. Loaded DataFrames.jl 1.2.2 and CSV.jl 0.9.4: 256 MB footprint
  3. Run GC many times: 230 MB footprint
  4. Run loading the file OP shared: 1603 MB footprint
  5. RUN GC many times: 1409 MB footprint
  6. Run loading the file OP shared: 2400 MB footprint
  7. RUN GC many times: 1413 MB footprint
  8. repeating steps 6 and 7 yields roughly the same values

@quinnj
Copy link
Member

quinnj commented Nov 16, 2021

Thanks @bkamins; so it seems indeed that this is the same glibc issue linked.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants