Memory used during parsing never reclaimed #850

baggepinnen · 2021-07-01T07:03:40Z

I have long been trying to find the source of what I suspected was a memory leak somewhere in my data pipeline and I think I have found a small MWE that reproduces my issue. If the following snippet is run multiple times, the memory use of the julia process increases steadily. If I call GC.gc();GC.gc();GC.gc();GC.gc();GC.gc();GC.gc();, I get some of it back but not all. If I continue to call the CSV.read another, say, 10 times, the memory claimed bu the julia process jumps up again and when I trigger the GC again, I get even less memory back and the julia process now holds on to more of it. I can continue this process until I run out of RAM

using CSV, DataFrames
logfile = "my_1GB_file.csv"
@time CSV.read(
    logfile,
    DataFrame;
    header = 15,
    datarow = 26,
    drop = (i, name) ->
        startswith(string(name), "Name") || startswith(string(name), "SymbolName"),
    delim = '\t',
    footerskip = 1,
);

Some concrete numbers:

With julia just started and CSV and DataFrames loded, julia uses up 167MB, after the first read, the figure is 1.7GB. Running GC multiple times brings this to 1.2GB.
Repeating the reading 10 times brings the memory usage to 5.7GB, and triggering GC brings it down to 2.1GB. Why does it not go down back to 1.2GB here?

The csv-file I'm using is about 1GB and 160MB zipped, I'd be happy to share it if someone wants to reproduce this issue.
Edit: It's here https://drive.google.com/file/d/1LQSKbDIYHb_N8NqnD40Xw-13V1uTCSOk/view?usp=sharing

I'm running

julia> versioninfo()
Julia Version 1.6.1
Commit 6aaedecc44 (2021-04-23 05:59 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i7-10510U CPU @ 1.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, skylake)
Environment:
  JULIA_NUM_THREADS = 4

(@v1.6) pkg> st CSV
      Status `~/.julia/environments/v1.6/Project.toml`
  [336ed68f] CSV v0.8.5

Edit: I've noticed that each read says something like

1.540763 seconds (793.73 k allocations: 1.238 GiB, 2.47% gc time, 19.43% compilation time)

i.e., it always claims a positive compilation time. Is it perhaps the compiled code that eventually eats up memory?

Edit2: The compilation time appears to be due to my drop function. Removing this removes the compilation time, but does not change the memory issues.

The text was updated successfully, but these errors were encountered:

bkamins · 2021-07-01T09:03:32Z

This is most likely due to threading. What happens if you use 1 thread (you can pass kwarg to force using only one thread)

baggepinnen · 2021-07-01T09:21:41Z

I tried using a single thread by starting julia with -t 1 and this did not solve the issue. (the kwarg to turn off threading did not work for me #852)

quinnj · 2021-07-06T22:33:38Z

@baggepinnen, I'd appreciate if you could share the file with me and I can take a look. Could you also share how you're tracking memory usage? I've found it difficult to make meaningful progress on "memory issues" in the past because it was too difficult to see matching results on different OS/versions/etc.

baggepinnen · 2021-07-07T03:56:03Z

The file is here
https://drive.google.com/file/d/1LQSKbDIYHb_N8NqnD40Xw-13V1uTCSOk/view?usp=sharing
I was previously manually looking at how much memory the Julia process consumed in the system monitor, but I could add in a call to Sys.free_memory (or something like that) to get an exact number if that's helpful.

quinnj · 2021-08-20T05:06:35Z

As an update, I can reproduce the memory seeming to not be reclaimed. After spending an hour or two digging around, I can't find any way the original data would still be reference, but the Julia GC seems to have a reference somehow. I've discussed the matter with core devs and it sounds like it will require doing some deeper GC debugging to try and figure out where/why the data is being referenced and thus not reclaimed. I'll dig in to this at some point, but I don't want it to hold up the next 0.9 release.

baggepinnen · 2021-08-20T05:13:36Z

I'm thrilled to hear that you got the attention of the core developers on this, it has bugged me for years. The problem is not specific to CSV.jl, but appear any time I read large data really. Thanks!

robsmith11 · 2021-10-14T04:21:32Z

Is there a JuliaLang issue tracking this? I believe that I've had similar issues with high memory usage not being reclaimed.

baggepinnen · 2021-10-23T06:43:15Z

Could this be related to JuliaLang/julia#42566 (comment)?

quinnj · 2021-10-23T06:47:56Z

That's my suspicion, because we certainly involve lots of "small vectors of vectors" in the internal parsing work. If someone has access to a windows machine, it seems like we could confirm that it's Not an issue here. I think I have an old windows machine somewhere in my basement I could try and dust off if no one else does.

bkamins · 2021-10-23T09:25:01Z

On Win10:

Started Julia 1.6.2 single threaded: 120MB footprint
Loaded DataFrames.jl 1.2.2 and CSV.jl 0.9.4: 250 MB footprint
Run GC many times: 222 MB footprint
Run loading the file OP shared: 1838 MB footprint
RUN GC many times: 418 MB footprint
repeating steps 4 and 5 yields roughly the same values

and

Started Julia 1.6.2 with 4 threads: 125MB footprint
Loaded DataFrames.jl 1.2.2 and CSV.jl 0.9.4: 256 MB footprint
Run GC many times: 230 MB footprint
Run loading the file OP shared: 1603 MB footprint
RUN GC many times: 1409 MB footprint
Run loading the file OP shared: 2400 MB footprint
RUN GC many times: 1413 MB footprint
repeating steps 6 and 7 yields roughly the same values

quinnj · 2021-11-16T07:25:13Z

Thanks @bkamins; so it seems indeed that this is the same glibc issue linked.

quinnj added the bug label Aug 20, 2021

quinnj closed this as completed Nov 16, 2021

krynju mentioned this issue Dec 18, 2021

Loading multiple .csv files uses ~double the memory it should #952

Closed

baggepinnen mentioned this issue Aug 4, 2022

glibc is optimized for memory allocations microbenchmarks JuliaLang/julia#42566

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory used during parsing never reclaimed #850

Memory used during parsing never reclaimed #850

baggepinnen commented Jul 1, 2021 •

edited

Loading

bkamins commented Jul 1, 2021

baggepinnen commented Jul 1, 2021

quinnj commented Jul 6, 2021

baggepinnen commented Jul 7, 2021

quinnj commented Aug 20, 2021

baggepinnen commented Aug 20, 2021 •

edited

Loading

robsmith11 commented Oct 14, 2021

baggepinnen commented Oct 23, 2021

quinnj commented Oct 23, 2021

bkamins commented Oct 23, 2021 •

edited

Loading

quinnj commented Nov 16, 2021

Memory used during parsing never reclaimed #850

Memory used during parsing never reclaimed #850

Comments

baggepinnen commented Jul 1, 2021 • edited Loading

Some concrete numbers:

bkamins commented Jul 1, 2021

baggepinnen commented Jul 1, 2021

quinnj commented Jul 6, 2021

baggepinnen commented Jul 7, 2021

quinnj commented Aug 20, 2021

baggepinnen commented Aug 20, 2021 • edited Loading

robsmith11 commented Oct 14, 2021

baggepinnen commented Oct 23, 2021

quinnj commented Oct 23, 2021

bkamins commented Oct 23, 2021 • edited Loading

quinnj commented Nov 16, 2021

baggepinnen commented Jul 1, 2021 •

edited

Loading

baggepinnen commented Aug 20, 2021 •

edited

Loading

bkamins commented Oct 23, 2021 •

edited

Loading