Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

child_process: measure buffer length in bytes #6764

Closed
wants to merge 12 commits into from
Closed

Conversation

Trott
Copy link
Member

@Trott Trott commented May 14, 2016

Checklist

child_process test

  • tests and code linting passes
  • a test and/or benchmark is included
  • the commit message follows commit guidelines
Affected core subsystem(s)

child_process test

Description of change

This change fixes a known issue where maxBuffer limits by characters
rather than bytes.

Fixes: #1901

Probably need to benchmark it against the current implementation. It's entirely possible no one has done it this way because of performance?

@Trott Trott added child_process Issues and PRs related to the child_process subsystem. test Issues and PRs related to the tests. known issue test labels May 14, 2016
@Trott
Copy link
Member Author

Trott commented May 14, 2016

I've added a benchmark to check performance and it shows that this is as performant as the current implementation. However, the benchmark could definitely use a review as it's very easy to do benchmarks wrong and I have no particular expertise. @nodejs/benchmarking

@Trott Trott force-pushed the bytes branch 2 times, most recently from 5765246 to 6060a64 Compare May 14, 2016 23:32
@jasnell
Copy link
Member

jasnell commented May 15, 2016

@trevnorris @mscdex

const dur = +conf.dur;
const len = +conf.len;

const msg = '"' + Array(len).join('.') + '"';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use '.'.repeat(len) for this kind of thing nowadays.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But also run msg.test(/./) to flatten the string. Creates more reliable benchmark results.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mscdex @trevnorris So I should do this?

const msg = `"${'.'.repeat(len)}"`;
msg.match(/./);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure. that looks good. i'm not sure what the internal mechanics of repeat() do, but if it's similar to doing += then should take care of it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, done! The benchmark shows frighteningly similar results for Node.js 6.2.0 vs. this PR.

6.2.0:

$ node benchmark/child_process/child-process-exec-stdout.js 
child_process/child-process-exec-stdout.js len=64 dur=5: 40873.45059
child_process/child-process-exec-stdout.js len=256 dur=5: 40853.93567
child_process/child-process-exec-stdout.js len=1024 dur=5: 40878.88575
child_process/child-process-exec-stdout.js len=4096 dur=5: 40851.36544
child_process/child-process-exec-stdout.js len=32768 dur=5: 40876.38241
$

This PR:

$ ./node benchmark/child_process/child-process-exec-stdout.js 
child_process/child-process-exec-stdout.js len=64 dur=5: 40873.08798
child_process/child-process-exec-stdout.js len=256 dur=5: 40860.52815
child_process/child-process-exec-stdout.js len=1024 dur=5: 40878.13205
child_process/child-process-exec-stdout.js len=4096 dur=5: 40869.97254
child_process/child-process-exec-stdout.js len=32768 dur=5: 40863.11006
$

@mscdex
Copy link
Contributor

mscdex commented May 15, 2016

I did some benchmarking outside of node recently that was related to this and IIRC concatenating strings was faster (rather than creating the string at the end)? This would need some double checking though.

@@ -277,7 +271,7 @@ exports.execFile = function(file /*, args, options, callback*/) {
if (!encoding)
_stderr.push(chunk);
else
_stderr += chunk;
_stderr += chunk.toString(encoding);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't this yield incorrect results if a variable-width encoded character is split across two chunks?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is true. That's why setEncoding() was being used initially. This will need to be solved differently I think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the solution is to always store the data as Buffers and concat/toString them just before being emitted.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't that what StringDecoder is for? (didn't look at this in detail so may be missing something)

Copy link
Contributor

@mscdex mscdex May 16, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One alternative would be to keep setEncoding(), etc. and only change the length increment lines to something like:

stderrLen += (this.encoding ? Buffer.byteLength(chunk, this.encoding) : chunk.byteLength);

I haven't benchmarked the difference between that solution and only converting to string at the end though.

EDIT: or another take on this alternative solution would be to add another data handler first that depends on whether encoding is set that increments the length appropriately, if you want to avoid doing the encoding check on every data event (maybe the performance difference is negligble, I haven't tested):

var onStderrData;
if (encoding) {
  child.stderr.setEncoding(encoding);
  onStderrData = function(chunk) {
    stderrLen += Buffer.byteLength(chunk, this.encoding);
  };
} else {
  onStderrData = function(chunk) {
    stderrLen += chunk.byteLength;
  };
}
child.stderr.addListener('data', onStderrData);
child.stderr.addListener('data', function(chunk) {
  if (stderrLen > options.maxBuffer) {
  // ...
});

EDIT 2: Actually the above solution wouldn't work since the encoding could be changed at any time, so you would need to do what I originally suggested (checking this.encoding on each chunk).

Copy link
Contributor

@mscdex mscdex May 16, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jasnell The problem this PR is trying to solve is that the units for maxBuffer are in bytes, but once you set up a string decoder, there is no way to know how many bytes went into making the string that gets emitted on data. That means you can no longer rely on chunk.length since you could have multi-byte characters.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mscdex ... +1 gotcha.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, does it make sense to emit on byte count and not character count if encoding is set? Only because the string passed to 'data' can vary noticeably if reading in utf8 characters.

@cjihrig
Copy link
Contributor

cjihrig commented May 16, 2016

By the way, the discussion in #1902 is probably relevant. The PR changed over time and some of the commits seem to be lost, but check out the comments by @piscisaureus.

@mscdex
Copy link
Contributor

mscdex commented May 16, 2016

@cjihrig I personally don't see a problem with going over by at most 3 bytes if the stream ends on a partial character, but you could always peek into the StringDecoder state to see how many bytes are buffered at the end of the stream to account for those bytes.

@trevnorris
Copy link
Contributor

@mscdex While I'm not sure whether we can account for this, there are cases where the buffered character is larger. Here's an exaggerated case:

const s = 'ก็็็็็็็็็็็็็็็็็็็็';
console.log(s.length);  // 21
console.log(Buffer.from(s));  // 63

@mscdex
Copy link
Contributor

mscdex commented May 16, 2016

@trevnorris and Buffer.byteLength(s, 'utf8') === 63, so I still don't see what the problem is with the presented alternative solutions (mine or yours).

If you pass say Buffer.from(s).slice(0, 3) to a string decoder, a single character will be returned. Calling Buffer.byteLength() on that character will return 3 bytes. Concatenating the output from the string deocder still results in the same s value.

@trevnorris
Copy link
Contributor

trevnorris commented May 16, 2016

@mscdex Sorry, didn't convey my point. It's that while the character code may be complete it's also possible that the rendered character is incomplete because of missing diacritical marks. But also that this isn't a case we're able to handle.

@trevnorris
Copy link
Contributor

For example:

Buffer('Á');  // <Buffer c3 81>
Buffer('Á');  // <Buffer 41 cc 81>

Appear to be the same character, but the first is actually '\u00c1' and the other is 'A\u0301'. So the rendered output on the other end is incomplete.

@mscdex
Copy link
Contributor

mscdex commented May 17, 2016

@trevnorris I still don't quite follow. Are you referring to a situation where Buffer.from([0x41,0xcc]) is passed to the string decoder but the needed 0x81 is never received from the actual stderr stream? If so, there is nothing that node or anyone else can ever do about that, but what matters is that we can count the bytes received. So in that particular case we'd see the 'A' emitted by the decoder which would count for one byte and at stderr end we can stderrLen += child.stderr._readableState.decoder.charReceived to account for the received 0xcc byte.

Now that I think about it, maybe yet another (better) solution is to just manually instantiate a string decoder and pass that the data chunks, that way we avoid having to peer into _readableState and the string decoder's state variables?

@jasnell
Copy link
Member

jasnell commented May 17, 2016

maybe yet another (better) solution is to just manually instantiate
a string decoder and pass that the data chunks, that way we avoid
having to peer into _readableState and the string decoder's state variables?

That's where I was leading with #6764 (comment)

@mscdex
Copy link
Contributor

mscdex commented May 17, 2016

@jasnell Ah ok. The only difference performance-wise with that last proposed solution is that it may be still be possible for streams to internally concatenate buffered Buffers (with Buffer.concat()) vs concatenating plain strings (when using .setEncoding()) before data is emitted? I'm not sure what that cost difference is.

@jasnell
Copy link
Member

jasnell commented May 17, 2016

Yeah, not sure either. Would need to benchmark it to be certain
On May 16, 2016 6:10 PM, "Brian White" notifications@github.com wrote:

@jasnell https://github.com/jasnell Ah ok. The only difference
performance-wise with that last proposed solution is that it may be still
be possible for streams to internally concatenate buffered Buffers (with
Buffer.concat()) vs concatenating plain strings (when using .setEncoding())
before data is emitted? I'm not sure what that cost difference is.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#6764 (comment)

@trevnorris
Copy link
Contributor

@mscdex This isn't about the string decoder. It's about the (very unlikely) possibility of data coming in, then say be logged. Where it would basically be:

console.log('A');
console.log('\u0301');

The rendered output would be wrong, even though each character is technically correct. I'm not saying we should consider supporting this properly. Just wanted to make the point that there are cases where even having the full utf8 characters won't render properly.

@jasnell
Copy link
Member

jasnell commented May 17, 2016

If someone was truly worried about that case they could buffer and wait for a new line to be received before calling console.log(). Not all that worried about that possibility

@mscdex
Copy link
Contributor

mscdex commented May 17, 2016

@trevnorris That's a separate issue though and node can't do anything about that anyway. Being able to count the bytes is the issue here, but even that separate issue isn't an issue for the majority of people that use exec() with a callback since the entire (buffered) string would be available for each stream (so it's not possible for users to console.log() multiple times unless they explicitly slice the string or also add their own stream event handlers, but that is a user problem at those points).

@trevnorris
Copy link
Contributor

@mscdex sure thing. there was another comment that prompted me to make mention of this, and clarify that node doesn't have any intention of handling it.

@Trott
Copy link
Member Author

Trott commented May 19, 2016

I've changed the implementation to only run toString() on the buffer in the exit handler. My interpretation of the discussion above is that this still doesn't account for some unlikely edge cases, but that we probably can't do much about those anyway. PTAL.

@mscdex
Copy link
Contributor

mscdex commented May 19, 2016

@Trott You'd need to do it in the close event since data can still be received after exit. Nevermind, I see that exithandler is actually used for the close event.

@Trott
Copy link
Member Author

Trott commented May 24, 2016

CI seems to be experiencing less heartburn now than 2 days ago. Let's try again. CI: https://ci.nodejs.org/job/node-test-pull-request/2754/


child.stdout.addListener('data', function(chunk) {
stdoutLen += chunk.length;
// If `child.stdout.setEncoding('utf8')` happened in userland, convert
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean any setEncoding() call, or specifically w/ 'utf8' passed?

Copy link
Member Author

@Trott Trott May 24, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean any setEncoding(). Will change the comment...

@Trott
Copy link
Member Author

Trott commented May 24, 2016

OK... CI looks good... https://ci.nodejs.org/job/node-test-pull-request/2754/

Benchmarks are alarmingly on par:

$ ./node-pr-6764 benchmark/child_process/child-process-exec-stdout.js 
child_process/child-process-exec-stdout.js len=64 dur=5: 40871.77387
child_process/child-process-exec-stdout.js len=256 dur=5: 40870.66593
child_process/child-process-exec-stdout.js len=1024 dur=5: 40905.69407
child_process/child-process-exec-stdout.js len=4096 dur=5: 40880.96963
child_process/child-process-exec-stdout.js len=32768 dur=5: 40856.23579
$ node-6.2.0 benchmark/child_process/child-process-exec-stdout.js 
child_process/child-process-exec-stdout.js len=64 dur=5: 40872.87520
child_process/child-process-exec-stdout.js len=256 dur=5: 40868.80411
child_process/child-process-exec-stdout.js len=1024 dur=5: 40878.62745
child_process/child-process-exec-stdout.js len=4096 dur=5: 40876.45268
child_process/child-process-exec-stdout.js len=32768 dur=5: 40866.85730
$

Do we feel good about this as a solution to the problem? If so, can I get an LGTM or two? If not, what are the deficiencies?

@mscdex
Copy link
Contributor

mscdex commented May 24, 2016

CI is green, LGTM.

Trott added a commit to Trott/io.js that referenced this pull request May 25, 2016
This change fixes a known issue where `maxBuffer` limits by characters
rather than bytes. Benchmark added to confirm no performance regression
occurs with this change.

PR-URL: nodejs#6764
Fixes: nodejs#1901
Reviewed-By: Brian White <mscdex@mscdex.net>
@Trott
Copy link
Member Author

Trott commented May 25, 2016

Landed in c9a5990

@Trott Trott closed this May 25, 2016
Fishrock123 pushed a commit to Fishrock123/node that referenced this pull request May 30, 2016
This change fixes a known issue where `maxBuffer` limits by characters
rather than bytes. Benchmark added to confirm no performance regression
occurs with this change.

PR-URL: nodejs#6764
Fixes: nodejs#1901
Reviewed-By: Brian White <mscdex@mscdex.net>
rvagg pushed a commit that referenced this pull request Jun 2, 2016
This change fixes a known issue where `maxBuffer` limits by characters
rather than bytes. Benchmark added to confirm no performance regression
occurs with this change.

PR-URL: #6764
Fixes: #1901
Reviewed-By: Brian White <mscdex@mscdex.net>
Trott added a commit to Trott/io.js that referenced this pull request Jun 23, 2016
This change fixes a known issue where `maxBuffer` limits by characters
rather than bytes. Benchmark added to confirm no performance regression
occurs with this change.

This necessarily changes default behavior of `stdout` and `stderr`
callbacks such that they receive buffers rather than strings. The
alternative would be a performance hit on the defaults.

Refs: nodejs#6764
Refs: nodejs#1901
@MylesBorins
Copy link
Contributor

@Trott lts?

@Trott
Copy link
Member Author

Trott commented Jul 11, 2016

@thealphanerd If it lands cleanly, yes.

MylesBorins pushed a commit that referenced this pull request Jul 11, 2016
This change fixes a known issue where `maxBuffer` limits by characters
rather than bytes. Benchmark added to confirm no performance regression
occurs with this change.

PR-URL: #6764
Fixes: #1901
Reviewed-By: Brian White <mscdex@mscdex.net>
@Trott Trott mentioned this pull request Jul 12, 2016
3 tasks
MylesBorins pushed a commit that referenced this pull request Jul 12, 2016
This change fixes a known issue where `maxBuffer` limits by characters
rather than bytes. Benchmark added to confirm no performance regression
occurs with this change.

PR-URL: #6764
Fixes: #1901
Reviewed-By: Brian White <mscdex@mscdex.net>
MylesBorins pushed a commit that referenced this pull request Jul 12, 2016
This change fixes a known issue where `maxBuffer` limits by characters
rather than bytes. Benchmark added to confirm no performance regression
occurs with this change.

PR-URL: #6764
Fixes: #1901
Reviewed-By: Brian White <mscdex@mscdex.net>
@MylesBorins MylesBorins mentioned this pull request Jul 12, 2016
MylesBorins pushed a commit that referenced this pull request Jul 14, 2016
This change fixes a known issue where `maxBuffer` limits by characters
rather than bytes. Benchmark added to confirm no performance regression
occurs with this change.

PR-URL: #6764
Fixes: #1901
Reviewed-By: Brian White <mscdex@mscdex.net>
MylesBorins pushed a commit that referenced this pull request Jul 14, 2016
This change fixes a known issue where `maxBuffer` limits by characters
rather than bytes. Benchmark added to confirm no performance regression
occurs with this change.

PR-URL: #6764
Fixes: #1901
Reviewed-By: Brian White <mscdex@mscdex.net>
@Trott Trott deleted the bytes branch January 13, 2022 22:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
child_process Issues and PRs related to the child_process subsystem. test Issues and PRs related to the tests.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

the maxBuffer isn't complete accurate in child_process
6 participants