Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix file descriptor exhaustion when streaming many files #121

Merged
merged 3 commits into from
Jan 12, 2016

Conversation

Klowner
Copy link
Contributor

@Klowner Klowner commented Oct 31, 2015

It appears as though streamFile was creating duplicate ReadStreams,
which somehow was causing the process to never free WriteStreams. Also
I've changed the removeListeners calls to unlisten to complete rather
than cb, since they're never told to listen to cb in the first
place.

I suspect this is related to #56 ?

@Klowner
Copy link
Contributor Author

Klowner commented Oct 31, 2015

var vfs = require('vinyl-fs'),
    exhaust = require('stream-exhaust'),

exhaust(vfs.src('/dir/with/ton/of/felix-the-cat-images/honest/*.png', {buffer: false})
    .pipe(vfs.dest('/elsewhere/')));

without this change I get about 4085 files into the copy and the process dies, after change I can copy all 9000 images.

@yocontra
Copy link
Member

@Klowner nice catch on the 1st part - I'm a little unclear on the 2nd part, maybe we should trace that back further to find the root cause why files are passed to that function more than once?

@Klowner
Copy link
Contributor Author

Klowner commented Oct 31, 2015

I should have been two PRs I suppose. I'd split it out but I'm at a wedding
right now.

It is strange that it happens twice, but the file.path at that point is
the destination, so creating a read stream there makes no sense, does it? I
got turned around a few times going through that :)

On Sat, Oct 31, 2015, 11:57 AM contra notifications@github.com wrote:

@Klowner https://github.com/klowner nice catch on the 1st part - I'm a
little unclear on the 2nd part, maybe we should trace that back further to
find the root cause why files are passed to that function more than once?


Reply to this email directly or view it on GitHub
#121 (comment).

@yocontra
Copy link
Member

yocontra commented Nov 1, 2015

@Klowner I think it's creating a new read stream every time it goes through dest, maybe?

@Klowner Klowner force-pushed the file-descriptor-exhaustion-issue branch from b1032b3 to 4c7837f Compare November 1, 2015 15:41
@Klowner
Copy link
Contributor Author

Klowner commented Nov 1, 2015

It looks like streamFile() is called initially via src() inside getContents(), assuming options.read is not equal to false. Then later during dest() it's called via writeStream()'s success() handler. I assume it's important to have the check in both places in cases where stream files are injected into the stream via means other than vinyl.src().

I suppose the BOM stripping should only happen once though?

@yocontra
Copy link
Member

yocontra commented Nov 1, 2015

@Klowner the idea behind that is that when dest consumes the stream it should restart it before passing the file back out (by creating a new read stream as it is writing it). I guess the problem is that it doesn't close the old stream before creating the new stream which means every dest call will double the # of open streams for a short period

@Klowner
Copy link
Contributor Author

Klowner commented Nov 1, 2015

@contra ahh, in retrospect that makes a lot of sense. So dest should be responsible for placing the stream's content onto the disk and then it should immediately create matching read streams to stream back off the disk and onto whatever else occurs later in the pipe?

Those presumably just get garbage collected? Or do they have to go into some sort of "sink"?

@Klowner
Copy link
Contributor Author

Klowner commented Nov 2, 2015

My apologies if I'm being dumb, but I'm pretty sure I at least understand why there's this issue of file descriptor exhaustion. With dest consuming files from the stream, writing them to disk, and then creating new readable streams back into the stream, those readable stream files (and associated file descriptors) build up at the end of the stream until all the incoming files have passed through dest.

Once the stream is finished, then all of those can be garbage collected and associated file descriptors are freed, but the issue I'm seeing when there are enough files passing through the stream to exhaust the ulimit or whatever, the node process is running out of file descriptors and ends up dying (albeit very silently, thanks to graceful-fs?).

I implemented a simple sink that handles closing file streams at the end of a stream, and that seems to solve the issue. Another idea I had but haven't tried testing is to try to make readStream lazy, so it only allocates a file descriptor when something attempts to read from it.

@Klowner Klowner mentioned this pull request Nov 2, 2015
@Klowner Klowner force-pushed the file-descriptor-exhaustion-issue branch 2 times, most recently from 717952c to ea6cbec Compare November 2, 2015 18:22
@yocontra
Copy link
Member

yocontra commented Nov 2, 2015

@Klowner maybe explore making dest not re-create the stream from the file system at all? in dest fork the file.contents stream into two read streams, start writing one to the hd then replace file.contents with the forked tail and pass it down the stream

@yocontra
Copy link
Member

yocontra commented Nov 3, 2015

@Klowner just did some digging, this could be interesting to play with https://www.npmjs.com/package/graceful-fs-stream

@Klowner Klowner force-pushed the file-descriptor-exhaustion-issue branch from ea6cbec to f079c90 Compare November 6, 2015 14:02
@Klowner
Copy link
Contributor Author

Klowner commented Nov 9, 2015

@contra using graceful-fs-stream seems to work great, no more file descriptor exhaustion. Unfortunately it introduces an issue with the futime stuff, since that expects the stream's fd to be available. I think I'll wait for the futime tickets to get closed and then see what I can do.

@phated
Copy link
Member

phated commented Nov 9, 2015

@Klowner the implementation inside graceful-fs-stream looks pretty straight forward, however, it monkeypatches the fs module, which we should probably avoid. If you were to implement those as standalone helpers in vfs, that way we could have access to the file descriptors, etc.

@Klowner
Copy link
Contributor Author

Klowner commented Nov 9, 2015

@phated eep, thank you for pointing that out 👍

@phated
Copy link
Member

phated commented Dec 10, 2015

@Klowner will you have a chance to update this PR to avoid using graceful-fs-stream?

@yocontra
Copy link
Member

Could also fix graceful-fs-stream to stop monkeypatching globals, your choice on which will be easiest though 🌴

@Klowner
Copy link
Contributor Author

Klowner commented Dec 11, 2015

I hope to get to it soon, December is always insane busy for me.

@phated
Copy link
Member

phated commented Dec 16, 2015

I'd actually love for this to land in graceful-fs proper with tests

@phated
Copy link
Member

phated commented Dec 16, 2015

graceful-fs seems to have tests for this at https://github.com/isaacs/node-graceful-fs/blob/master/test/read-write-stream.js#L14-L31

I changed the num to 9000 and the tests still pass. Is this maybe something we are doing wrong?

@Klowner
Copy link
Contributor Author

Klowner commented Dec 16, 2015

I don't think we're doing anything wrong here, I'm pretty sure the difference is that in those tests the file streams are being opened, written to, and closed. As opposed to vinyl-fs's dest() which writes and closes, but then immediately re-opens the written file and passes it onto the next stream operation. When you enable buffer, then the files are being immediately read, and I believe the file descriptor is closed shortly thereafter, but when buffer: false is set, then it just opens a ton of file descriptors and passes them down the pipe. Unless something is closing those file descriptors, they have to wait for garbage collection in order to be closed.

Using something like graceful-fs prevents the file streams from being "opened" until something needs to read the file contents, so we avoid the whole file descriptor pile-up issue. I'm going to dig into it here this evening if all goes as planned. 👍

@yocontra
Copy link
Member

@Klowner Thanks for your time helping look into this, btw 🙌

@Klowner
Copy link
Contributor Author

Klowner commented Dec 20, 2015

The sad realization that my initial "solution" using graceful-fs-stream was because it was monkeypatching global fs. It looks like those ReadableStreams that are holding file descriptors are being constructed somewhere deep within through2. 😭

@Klowner Klowner closed this Dec 21, 2015
@Klowner Klowner force-pushed the file-descriptor-exhaustion-issue branch from f079c90 to 8fc3dd2 Compare December 21, 2015 20:50
@Klowner Klowner deleted the file-descriptor-exhaustion-issue branch December 21, 2015 20:53
@Klowner Klowner restored the file-descriptor-exhaustion-issue branch December 21, 2015 20:53
@Klowner Klowner reopened this Dec 21, 2015
@Klowner
Copy link
Contributor Author

Klowner commented Dec 21, 2015

lazystream seems to work very well, I was able to stream 9000 files through two dest()s.

@phated
Copy link
Member

phated commented Dec 21, 2015

@Klowner that looks super cool. A couple things I'd like to get the maintainer to do would be only rely on readable-stream and bump to 1.0. Could you open an issue over there asking if those were possible?

@phated
Copy link
Member

phated commented Dec 21, 2015

Also, do we need to update the Writeable to use this too?

@Klowner
Copy link
Contributor Author

Klowner commented Dec 21, 2015

I'll check with their maintainer. I don't think the writable needs to be
used, as it just prevents the output stream from allocating the output fd
until data is written to it.

On Mon, Dec 21, 2015, 3:13 PM Blaine Bublitz notifications@github.com
wrote:

Also, do we need to update the Writeable to use this too?


Reply to this email directly or view it on GitHub
#121 (comment).

@Klowner
Copy link
Contributor Author

Klowner commented Dec 22, 2015

@phated just the clarify, you'd like lazystream to depend on any version of readable-stream rather than ~1.0.2?

@Klowner
Copy link
Contributor Author

Klowner commented Dec 22, 2015

relevant issue jpommerening/node-lazystream#2

@Klowner
Copy link
Contributor Author

Klowner commented Dec 22, 2015

aaaaaand I grossly misunderstood 😝

@Klowner Klowner force-pushed the file-descriptor-exhaustion-issue branch 2 times, most recently from e044e85 to 4b51102 Compare December 29, 2015 16:44
@Klowner Klowner force-pushed the file-descriptor-exhaustion-issue branch from 4b51102 to c63a797 Compare December 29, 2015 16:48
@Klowner
Copy link
Contributor Author

Klowner commented Dec 29, 2015

Strange, looks like the wipeOut() portion of the tests failed on Node stable.

@phated
Copy link
Member

phated commented Dec 29, 2015

@Klowner you can increase the timeout with this.timeout(8000) or if we don't want a timeout, just use 0

@phated
Copy link
Member

phated commented Dec 29, 2015

@Klowner Is there a way you can write a test for this or would it make the test suite insanely long running?

@Klowner
Copy link
Contributor Author

Klowner commented Dec 29, 2015

The manual test I performed with 9000 144kb files (~1.3GB) actually runs pretty quick, but that's a ton of disk space. I suppose if we could get src() to create ~5000 readable streams from the same file... it's not immediately obvious how that could be done but I can definitely look into it.

@Klowner
Copy link
Contributor Author

Klowner commented Dec 30, 2015

I added a test, the downside is that it's a little slow and consumes about 25MB of disk space to operate, but that personally doesn't seem too unreasonable.

Without lazystream, npm test just silently dies with a 1 return code, I assume Travis would pick that up as a failure?

@Klowner
Copy link
Contributor Author

Klowner commented Dec 30, 2015

According to Travis docs the non-zero return code should signal a test failure, so this test should work! Let me know if that's cool for you, @phated 👍

@phated
Copy link
Member

phated commented Jan 12, 2016

@Klowner you, sir, are awesome. Thanks for all the work you put into this patch.

phated added a commit that referenced this pull request Jan 12, 2016
Fix file descriptor exhaustion when streaming many files
@phated phated merged commit 9d276f1 into gulpjs:master Jan 12, 2016
@Klowner
Copy link
Contributor Author

Klowner commented Jan 12, 2016

You bet, sorry it took me so long :)

phated added a commit that referenced this pull request Dec 5, 2017
Fix file descriptor exhaustion when streaming many files
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants