Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize event handling with large arguments #9643

Closed
wants to merge 1 commit into from

Conversation

simonkcleung
Copy link

@simonkcleung simonkcleung commented Nov 16, 2016

Checklist
Affected core subsystem(s)
Description of change

In V8, it is faster to create an array by [] instead of new Array().

In V8, it is faster to create an array by ```[]``` instead of ``` new Array()'''.
@nodejs-github-bot nodejs-github-bot added the events Issues and PRs related to the events subsystem / EventEmitter. label Nov 16, 2016
@cjihrig
Copy link
Contributor

cjihrig commented Nov 16, 2016

Is it still faster taking into consideration that the current code creates an appropriately sized array from the start?

@mscdex
Copy link
Contributor

mscdex commented Nov 16, 2016

I'm a bit skeptical about this, can you provide a benchmark results comparison using benchmark/compare.R?

@simonkcleung
Copy link
Author

simonkcleung commented Nov 16, 2016

@cjihrig
yes, it is faster

@mscdex
Not familiar with benchmark. I tested it in https://github.com/nodejs/node/blob/master/benchmark/events/ee-emit-multi-args.js and line 17 changed to ee.emit('dummy',1,2,3,4,5,6);

@JacksonTian
Copy link
Contributor

-1 if without benchmark result.

@Trott
Copy link
Member

Trott commented Nov 16, 2016

I ran the events benchmarks and the statistically-significant results indicate that this new code is slightly slower.

Command I ran (after compiling the master branch and moving it to /var/tmp, then compiling the master branch with this patch applied and moving it to a different name in /var/tmp):

 node benchmark/compare.js --old /var/tmp/node-master --new /var/tmp/node-simonkcleung events | tee /var/tmp/events.csv

Command I then ran to process the data:

cat /var/tmp/events.csv | Rscript benchmark/compare.R

Results:

                                                     improvement significant    p.value
 events/ee-add-remove.js n=250000                         0.76 %             0.07922860
 events/ee-emit-multi-args.js n=2000000                  -1.30 %           * 0.04312630
 events/ee-emit.js n=2000000                             -0.63 %             0.63112687
 events/ee-listener-count-on-prototype.js n=50000000     -0.26 %             0.52637424
 events/ee-listeners-many.js n=5000000                   -1.25 %           * 0.03766378
 events/ee-listeners.js n=5000000                         0.19 %             0.75764889

@mscdex
Copy link
Contributor

mscdex commented Nov 16, 2016

Yeah I don't see how calling push() on a non-pre-allocated array could be faster than a pre-allocated array. I could maybe understand if it performed the same if V8 implicitly pre-allocates some space and the benchmark only happens to push() a number of items less than that initial internal allocation. Otherwise V8 has to perform some kind of additional allocation on push(), whereas specifying a size upfront would never (or at least should not) cause any additional allocations (provided you don't try to append more than that specified size of course).

@Trott
Copy link
Member

Trott commented Nov 16, 2016

ALTHOUGH...I changed the benchmark file as described by @simonkcleung in #9643 (comment) and the improvement if I do that is downright astonishing:

events/ee-emit-multi-args.js n=2000000                1017.58 %         *** 3.928048e-40

@Fishrock123
Copy link
Contributor

@Trott are you sure you changed that right?

@Trott
Copy link
Member

Trott commented Nov 17, 2016

@Fishrock123 Yes, I'm pretty certain, but if you or someone else (@mscdex?) want to see if you can or can't replicate my results, that would be great.

@targos
Copy link
Member

targos commented Nov 17, 2016

I also get a similar result:

events/ee-emit-multi-args.js n=2000000                 991.02 %         *** 4.679818e-72

It still feels terribly wrong to me...

@bnoordhuis
Copy link
Member

Can you post the diff? I'd like to give it a try.

@targos
Copy link
Member

targos commented Nov 17, 2016

You mean this diff ?

diff --git a/benchmark/events/ee-emit-multi-args.js b/benchmark/events/ee-emit-multi-args.js
index b423c21..0ca3026 100644
--- a/benchmark/events/ee-emit-multi-args.js
+++ b/benchmark/events/ee-emit-multi-args.js
@@ -14,7 +14,7 @@ function main(conf) {

   bench.start();
   for (var i = 0; i < n; i += 1) {
-    ee.emit('dummy', 5, true);
+    ee.emit('dummy', 1, 2, 3, 4, 5, 6);
   }
   bench.end(n);
 }

@targos
Copy link
Member

targos commented Nov 18, 2016

So I profiled a run of this modified benchmark with current master and the result shows that it passes a significant amount of time in Runtime_CreateListFromArrayLike :

 [C++ entry points]:
   ticks    cpp   total   name
   4285   99.1%   80.7%  v8::internal::Runtime_CreateListFromArrayLike(int, v8::internal::Object**, v8::internal::Isolate*)
     20    0.5%    0.4%  v8::internal::Builtin_HandleApiCall(int, v8::internal::Object**, v8::internal::Isolate*)
...

 [Bottom up (heavy) profile]:
...
   ticks parent  name
    542   10.2%  void v8::internal::LookupIterator::Start<true>()
    542  100.0%    v8::internal::Runtime_CreateListFromArrayLike(int, v8::internal::Object**, v8::internal::Isolate*)
    542  100.0%      LazyCompile: *emit events.js:136:44
    535   98.7%        LazyCompile: *main /home/mzasso/git/forks/node/benchmark/events/ee-emit-multi-args.js:7:14
    535  100.0%          LazyCompile: ~Benchmark.process.nextTick /home/mzasso/git/forks/node/benchmark/common.js:24:22
    535  100.0%            LazyCompile: ~_combinedTickCallback internal/process/next_tick.js:65:33

This doesn't happen when this PR is applied (Runtime_CreateListFromArrayLike doesn't appear in the profile).

master.txt
pr-9643.txt

@mscdex
Copy link
Contributor

mscdex commented Nov 18, 2016

Ok so the regression happened between node v5.x and v6.0.0, which is V8 v4.6 and v5.1 respectively.

I've also found out that the performance regression actually has nothing to do with either the value types being assigned or the act of assigning values to the array in general, but it seems to be the array allocation causing the slowdown. If I remove the array assignment loop completely and just do:

  • new Array(6): it's still as slow as it is currently
  • new Array(1): ... same ...
  • new Array(0), new Array(), or []: now really fast (perhaps obviously)

So it's not related to explicit constructor use or the supplying of a length argument to the constructor, but it's whenever a positive, non-zero length is passed to the constructor.

Without bisecting, there is one particular V8 commit that was landed during V8 4.9 where array construction logic was changed (and these changes still exist in master) that might be a good candidate...

@targos
Copy link
Member

targos commented Nov 18, 2016

cc @bmeurer ?

@bmeurer
Copy link
Member

bmeurer commented Nov 18, 2016

W/o looking further into the benchmark, it seems that there's some Function.prototype.apply or Reflect.apply/Reflect.construct call involved, that would also explain the performance difference.

If you do

var a = [];
for (var i = 0; i < len; ++i) {
  a[i] = some value;
}

that creates a non-holey array (i.e. V8 knows that there are no holes in the elements and thus we don't need to fall back to lookups on prototypes). However

var a = new Array(len);
for (var i = 0; i < len; ++i) {
  a[i] = some value;
}

creates a holey array from the get go (thanks to EcmaScripts wonderful "length" property magic).

So if you now use such an array with Function.prototype.apply (or the Reflect.apply/Reflect.construct ES6 additions), you'll hit the fast path for non-holey arrays where we just push all elements onto the stack and call through, while for holey arrays, we first need to turn them into a list (represented as FixedArray internally) and can then push them onto the stack.

There's an open issue to extend this fast case at some point to also cover a bunch of holey array cases, but so far we didn't get back to that.

@mscdex
Copy link
Contributor

mscdex commented Nov 18, 2016

Here is a simpler repro, it uses a constant length value (which I think V8 typically optimizes for?) but it still seems to exhibit the same slowdown:

function foo() {
  return 1 + 1;
}

var n = 2e6;

console.time('fn.apply');
for (var i = 0; i < n; i += 1) {
  //var args = new Array(6);
  //args[0] = args[1] = args[2] = args[3] = args[4] = args[5] = 1;
  var args = [];
  args.push(1);
  args.push(1);
  args.push(1);
  args.push(1);
  args.push(1);
  args.push(1);
  foo.apply(this, args);
}
console.timeEnd('fn.apply');

Replace the push()s with the commented section to see the difference.

  • node v5.12.0 / V8 4.6.85.32
    • new Array(6): fn.apply: 231.184ms
    • sequential push()s: fn.apply: 303.116ms
  • node v6.0.0 / V8 5.0.71.35
    • new Array(6): fn.apply: 913.341ms
    • sequential push()s: fn.apply: 156.452ms
  • node master / V8 5.4.500.41
    • new Array(6): fn.apply: 972.002ms
    • sequential push()s: fn.apply: 148.616ms

EDIT: oops, didn't see the responses above before I posted this....

@bmeurer
Copy link
Member

bmeurer commented Nov 18, 2016

Yap, so it's the missing fast path for holey arrays in Function.prototype.apply then.

@bmeurer
Copy link
Member

bmeurer commented Nov 18, 2016

This is related to https://bugs.chromium.org/p/v8/issues/detail?id=4826, although that particular one is about the double arrays.

@bnoordhuis
Copy link
Member

Yes, that's indeed what happens.

$ node --allow_natives_syntax -e '%DebugPrint([1,2,3])' | grep 'elements kind'
 - elements kind: FAST_SMI_ELEMENTS

$ node --allow_natives_syntax -e '%DebugPrint([5,true])' | grep 'elements kind'
 - elements kind: FAST_ELEMENTS

$ node --allow_natives_syntax -e '%DebugPrint(new Array(2))' | grep 'elements kind'
 - elements kind: FAST_HOLEY_SMI_ELEMENTS

@bmeurer
Copy link
Member

bmeurer commented Nov 18, 2016

I'll see if I can cook a quickfix for the FAST_HOLEY_SMI_ELEMENTS and FAST_HOLEY_ELEMENTS cases.

@mscdex
Copy link
Contributor

mscdex commented Nov 18, 2016

Even

var args = [];
args[0] = args[1] = args[2] = args[3] = args[4] = args[5] = 1;
fn.apply(this, args);

is just as slow FWIW.

@bmeurer
Copy link
Member

bmeurer commented Nov 18, 2016

Sure, you store to args[5] first, meaning you turn args into a holey array with holes at elements 0,...,4.

@mscdex
Copy link
Contributor

mscdex commented Nov 18, 2016

Sure, you store to args[5] first, meaning you turn args into a holey array with holes at elements 0,...,4.

Ah oops you're right :-) That's what I get for trying to make a one-liner... Reversing the indices is fast though as expected now.

@bmeurer
Copy link
Member

bmeurer commented Nov 18, 2016

I have a fix here (intel only): https://codereview.chromium.org/2510043004

Copy link
Member

@jasnell jasnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm -1 on this. Using the preallocated array is preferable when possible.

@simonkcleung
Copy link
Author

As @bmeurer pointed out, using a holy array in apply() causes the slowdown. If it is fixed (in V8), nothing have to be changed.

@fhinkel
Copy link
Member

fhinkel commented Dec 6, 2016

I think we're still waiting for the S390 and PPC port of that V8 fix /cc @jasnell

@mscdex mscdex added the v8 engine Issues and PRs related to the V8 dependency. label Dec 6, 2016
@jbajwa
Copy link
Contributor

jbajwa commented Dec 10, 2016

@fhinkel Sorry for the delay, I somehow missed this CL. Just uploaded the port for PPC/s390.

@bmeurer
Copy link
Member

bmeurer commented Dec 11, 2016

@jbajwa Thanks. So with your port, we have Intel, ARM, MIPS and PPC/s390 ports for the regression.

@jasnell Pre-allocating an array is not generally beneficial, depending on how the array is used. For example, load/store access to non-holey array is generally faster. So if you can create an array by pushing elements to it, that can be performance-win if the potential overhead from copying during setup is not dominating.

@mscdex
Copy link
Contributor

mscdex commented Jan 17, 2017

Any word on if/when the relevant upstream V8 changes will be backported to V8 5.4 (assuming V8 5.5 won't make it into node v7)?

@addaleax
Copy link
Member

@mscdex Not sure if that’s even a partial answer to your question, but according to #9730 (comment) V8 5.4 is not maintained anymore, which I guess means that we’d have to take care of any backports ourselves?

@mscdex
Copy link
Contributor

mscdex commented Jan 17, 2017

@addaleax If that's the case, I thought there was still supposed to be some sort of special coordination going on to get them backported on V8's side so we wouldn't have to float patches ourselves? Or is that only for V8 versions used by node LTS releases?

@addaleax
Copy link
Member

Sorry, I don’t really know what our current process for these things is. @nodejs/v8 probably knows?

@bmeurer
Copy link
Member

bmeurer commented Jan 17, 2017

As far as I know we only backmerge fixes for Node LTS versions (and Chrome versions in the wild). @fhinkel might know more.

These CLs might be quite easy to backmerge AFAIR.

@natorion
Copy link

natorion commented Jan 17, 2017

EDIT: Node has its own V8 fork where patches are merged to. See https://github.com/nodejs/node/blob/master/doc/guides/maintaining-V8.md for more information on this.

Seems like the separate fork is still a proposal. Please refer to https://github.com/nodejs/node/blob/master/doc/guides/maintaining-V8.md#backporting-to-abandoned-branches as a manual for backporting.

@ofrobots is often coordinating this.

@ofrobots
Copy link
Contributor

I'm +1 on backporting for 5.4, however this would also be need to be backported to V8 5.5 and 5.6 upstream. @natorion How likely is upstream to approve the merge request for 5.5 and 5.6?

@fhinkel
Copy link
Member

fhinkel commented Jan 23, 2017

@ofrobots 5.5 is abandoned as of this week. We'd need to backmerge ourselves, just like for 5.4.

@jasnell
Copy link
Member

jasnell commented Mar 24, 2017

Status update on this one?

@jasnell jasnell added the stalled Issues and PRs that are stalled. label Mar 24, 2017
@fhinkel
Copy link
Member

fhinkel commented Mar 26, 2017

I think we can close this PR because the performance problem is addressed upstream in V8.

@mscdex mscdex closed this Mar 26, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
events Issues and PRs related to the events subsystem / EventEmitter. stalled Issues and PRs that are stalled. v8 engine Issues and PRs related to the V8 dependency.
Projects
None yet
Development

Successfully merging this pull request may close these issues.