Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retaining chunks while reading from fs stream appears to leak memory rapidly #21967

Closed
ChALkeR opened this issue Jul 25, 2018 · 12 comments
Closed
Labels
buffer Issues and PRs related to the buffer subsystem. fs Issues and PRs related to the fs subsystem / file system. memory Issues and PRs related to the memory management or memory footprint.

Comments

@ChALkeR
Copy link
Member

ChALkeR commented Jul 25, 2018

  • Version: v10.7.0, v8.11.3
  • Platform: Linux yoga 4.17.2-1-ARCH #1 SMP PREEMPT Sat Jun 16 11:08:59 UTC 2018 x86_64 GNU/Linux
  • Subsystem: fs, buffer

While trying to resolve Gzemnid memory problems at nodejs/Gzemnid#18, I eventually reduced those to the following testcase. It seems to me that it's not node-lz4 fault, but something is wrong on the Node.js side.

Testcase:

'use strict';

const fs = require('fs');

fs.writeFileSync('test-small.bin', Buffer.alloc(100));
fs.writeFileSync('test-large.bin', Buffer.alloc(100000));

const maxLength = 4 * 1024 * 1024; // 4 MiB
let built = 0, buf = [], length = 0;

function tick() {
  built++;
  if (built % 1000 === 0) {
    console.log(`RSS [${built}]: ${process.memoryUsage().rss / 1024 / 1024} MiB`);
  }
  const stream = fs.createReadStream('./test-small.bin');
  stream.on('data', function (data) {
    //data = Buffer.from(data); // WARNING: uncommenting this line fixes things somehow
    buf.push(data)
    length += data.length
    if (length >= maxLength) {
      buf = []
      length = 0
    }
  });
  stream.once('end', tock);
}

function tock() {
  fs.readFileSync('./test-large.bin');
  setImmediate(tick);
  /* Alternatively:
  const stream = fs.createReadStream('./test-large.bin');
  stream.on('data', function() {});
  stream.once('end', tick);
  */
}

tick();

Adding data = Buffer.from(data); fixes things somehow, problems start when the exact same chunks from the stream are retained for some time and some larger file reading goes on.

gc-ing manually does not help — this looks like a memory leak.
All that memory is allocated through node::ArrayBufferAllocator::Allocate.

/cc @addaleax @nodejs/buffer

@ChALkeR ChALkeR added buffer Issues and PRs related to the buffer subsystem. fs Issues and PRs related to the fs subsystem / file system. memory Issues and PRs related to the memory management or memory footprint. labels Jul 25, 2018
@ChALkeR
Copy link
Member Author

ChALkeR commented Jul 25, 2018

It might be caused by the fact that every n-th allocated pooled buffer chunk is retained in this scenario, effectively retaning the whole underlying buffers, whie needing only a small chunk of it actually.

@addaleax
Copy link
Member

@ChALkeR Yes, looking at Buffer pooling would be my first guess, too. I’m not sure how they would be retained indefinitely, though, because you do reset buf = []?

@ChALkeR
Copy link
Member Author

ChALkeR commented Jul 25, 2018

@addaleax Shouldn't be caused by js-side buffer pooling as the second file has chunks larger than poolsize. Also, I was yet unable to remove the dependency on fs — trying to allocate buffers manually (or even pass the one obtained from fs through Buffer.from) destroys the leak.

Updated testcase which removes the second file:

'use strict';

const fs = require('fs');

fs.writeFileSync('test-small.bin', Buffer.alloc(100));

const maxLength = 4 * 1024 * 1024; // 4 MiB
let built = 0, buf = [], length = 0;

function tick() {
  built++;
  if (built % 1000 === 0) {
    console.log(`RSS [${built}]: ${process.memoryUsage().rss / 1024 / 1024} MiB`);
  }
  const stream = fs.createReadStream('./test-small.bin');
  stream.on('data', function (data) {
    //data = Buffer.from(data); // WARNING: uncommenting this line fixes things somehow
    buf.push(data)
    length += data.length
    if (length >= maxLength) {
      buf = []
      length = 0
    }
  });
  stream.once('end', tock);
}

function tock() {
  Buffer.alloc(65536);
  setImmediate(tick);
}

tick();

@addaleax
Copy link
Member

@ChALkeR I think it’s not about Buffer’s pooling, I think it’s about fs stream buffer pooling – that would be consistent with what I can tell from inspecting a heap dump, where every single chunk retains its own backing pool.

In particular, this line seems to be not ideal:

pool.used += toRead;

This adjusts the pool offset by what we planned to read, which could be the whole pool size, not what we actually read.

@addaleax
Copy link
Member

I think it’s a good sign that this seems to “solve” the problem:

diff --git a/lib/internal/fs/streams.js b/lib/internal/fs/streams.js
index f527b1de4b84..f2cda91c6873 100644
--- a/lib/internal/fs/streams.js
+++ b/lib/internal/fs/streams.js
@@ -171,6 +171,8 @@ ReadStream.prototype._read = function(n) {
       this.emit('error', er);
     } else {
       let b = null;
+      if (start + toRead === thisPool.used)
+        thisPool.used += bytesRead - toRead;
       if (bytesRead > 0) {
         this.bytesRead += bytesRead;
         b = thisPool.slice(start, start + bytesRead);

It only works if different streams don’t concurrently use the same pool, though, so I’m not sure it’s something we’d want to go with.

@ChALkeR
Copy link
Member Author

ChALkeR commented Jul 25, 2018

More details about the usecase: I needed to concatenate a large number of files into a small number of files with compression. Something like this schematic code:

cat a*.txt | lz4 > a.txt.lz4
cat b*.txt | lz4 > b.txt.lz4

In Gzemnid, I created several lz4 encoder streams (using node-lz4), piped those to output files, and upon encoutering a file to read — piped it into one of those lz4 streams (depending on the file).

Now, the thing is that node-lz4 uses block compression and retains everything thrown into it until a sufficient input size is reached (which is 4 MiB by default) — and that data effectively comes from fs.createReadStream in this scenario, which got hit by this issue.

@ChALkeR
Copy link
Member Author

ChALkeR commented Jul 25, 2018

@addaleax What is also interesting is how exactly does Buffer.alloc(65536) at the bottom affect the memory usage. This is the reduced testcase without fs:

'use strict';

const maxLength = 4 * 1024 * 1024; // 4 MiB
let built = 0, buf = [], length = 0;

function tick() {
  built++;
  if (built % 1000 === 0) {
    console.log(`RSS [${built}]: ${process.memoryUsage().rss / 1024 / 1024} MiB`);
  }
  let data = Buffer.allocUnsafe(65536).slice(0, 128).fill(0);
  //data = Buffer.from(data); // uncommenting this line fixes things
  buf.push(data)
  length += data.length
  if (length >= maxLength) {
    buf = []
    length = 0
  }
  Buffer.alloc(65536); // commenting this line also significantly reduces memory usage
  setImmediate(tick);
}

tick();

High memory usage is expected though. Probably something related to memory management.

@ChALkeR
Copy link
Member Author

ChALkeR commented Jul 25, 2018

@addaleax I guess that this is what's going on (re: how Buffer.alloc() affects things):

  1. When we have just the first part, Buffer.allocUnsafe (for the pool) allocates memory, but does not use it yet. Only a small chunk is used and this does not hit RSS immediately.
  2. When simultaneous Buffer usage is going on, it allocates, fills and uses a memory chunk, but quickly disposes of it. That chunk gets reused for the fs pool allocation (Buffer.allocUnsafe), which makes the underlying pool retain the actually claimed memory.

@ChALkeR
Copy link
Member Author

ChALkeR commented Jul 25, 2018

Testcase for that:

'use strict';

const bs = 1024 * 1024; // 1 MiB
const retained = [];
let i = 0, flag = false;

function tick() {
  i++;
  if (i % 1000 === 0) {
    console.log(`RSS [${i}]: ${process.memoryUsage().rss / 1024 / 1024} MiB`);
  }
  const buf = Buffer.allocUnsafe(bs);
  retained.push(buf);
  if (i === 20000) {
    console.log('Clearing retained and enabling alloc');
    retained.length = 0;
    flag = true;
  }
  if (flag) Buffer.alloc(bs); // Even Buffer.alloc(bs - 10) seems to be fine here
  if (i < 40000) setImmediate(tick);
}

tick();

@ChALkeR
Copy link
Member Author

ChALkeR commented Jul 25, 2018

The above (alloc affecting allocUnsafe memory usage) seems to be the underlying malloc behavior:

// Compile with -O0

#include <iostream>
#include <unistd.h>

using namespace std;

int main() {
  const int bs = 1024 * 1024; // 1 MiB
  const int count = 1000;
  void * x[count];
  for (int i = 0; i < count; i++) {
    free(calloc(bs * 2, 1)); // Commenting out this line reduces memory usage
    x[i] = malloc(bs);
  }
  cout << "Allocated" << endl;
  cout << "Sleeping..." << endl;
  sleep(20);
  cout << "Freeing" << endl;
  for (int i = 0; i < count; i++) {
    free(x[i]);
  }
  cout << "Hello" << endl;
  return 0;
}

@ChALkeR
Copy link
Member Author

ChALkeR commented Jul 25, 2018

@addaleax Btw, testcases from #21967 (comment), #21967 (comment) and #21967 (comment) (but not the initial one) have significantly less memory usage when LD_PRELOAD=/usr/lib64/libjemalloc.so is used.

Why don't we use jemalloc, btw?

@addaleax
Copy link
Member

@ChALkeR The question has come up a few times, but I don’t know that we’ve ever done a full investigation into the benefits/downsides. I don’t think the examples here are something that we should base a decision on; that the OS may lazily populate memory pages is not really something for Node to care about, imo.

targos pushed a commit that referenced this issue Jul 31, 2018
Fixes: #21967

PR-URL: #21968
Reviewed-By: Сковорода Никита Андреевич <chalkerx@gmail.com>
Reviewed-By: Matteo Collina <matteo.collina@gmail.com>
Reviewed-By: Benjamin Gruenbaum <benjamingr@gmail.com>
Reviewed-By: Trivikram Kamat <trivikr.dev@gmail.com>
Reviewed-By: James M Snell <jasnell@gmail.com>
ChALkeR added a commit to nodejs/Gzemnid that referenced this issue Aug 1, 2018
This hack should be reverted once
nodejs/node#21967 gets included into a
Node.js v8.x release.

Fixes: #18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
buffer Issues and PRs related to the buffer subsystem. fs Issues and PRs related to the fs subsystem / file system. memory Issues and PRs related to the memory management or memory footprint.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants