Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimized waterfall, parallel, et al. #1395

Merged
merged 12 commits into from
Apr 7, 2017
Merged

Optimized waterfall, parallel, et al. #1395

merged 12 commits into from
Apr 7, 2017

Conversation

aearly
Copy link
Collaborator

@aearly aearly commented Apr 3, 2017

I ran some profiling, and notices that a lot of time is being spent in internal/rest. It turns out rest() is much slower than a straightforward arguments slice(). (I also tried the "inline slice" strategy with a loop that is sometimes recommended, it made absolutely no difference). The differences are significant, especially for parallel:

Benchmark Results (click to expand)
$ node perf/benchmark.js -g 'series|waterfall|auto|parallel'
Latest tag is  v2.3.0
Comparing v2.3.0 with current on Node v7.7.4
--------------------------------------
(node:29289) DeprecationWarning: Calling an asynchronous function without callback is deprecated.
eachSeries(10) v2.3.0 x 51,744 ops/sec ±1.47% (27 runs sampled), 0.0193ms per run
eachSeries(10) current x 54,845 ops/sec ±1.50% (29 runs sampled), 0.0182ms per run
current is faster
--------------------------------------
eachSeries(300) v2.3.0 x 2,718 ops/sec ±2.76% (29 runs sampled), 0.368ms per run
eachSeries(300) current x 3,108 ops/sec ±1.69% (25 runs sampled), 0.322ms per run
current is faster
--------------------------------------
eachSeries(10000) v2.3.0 x 87.15 ops/sec ±2.30% (29 runs sampled), 11.5ms per run
eachSeries(10000) current x 106 ops/sec ±0.96% (30 runs sampled), 9.42ms per run
current is faster
--------------------------------------
mapSeries(10) v2.3.0 x 51,520 ops/sec ±1.06% (31 runs sampled), 0.0194ms per run
mapSeries(10) current x 55,203 ops/sec ±1.58% (30 runs sampled), 0.0181ms per run
current is faster
--------------------------------------
mapSeries(300) v2.3.0 x 2,832 ops/sec ±0.96% (30 runs sampled), 0.353ms per run
mapSeries(300) current x 3,108 ops/sec ±1.62% (32 runs sampled), 0.322ms per run
current is faster
--------------------------------------
mapSeries(10000) v2.3.0 x 85.19 ops/sec ±2.45% (29 runs sampled), 11.7ms per run
mapSeries(10000) current x 95.28 ops/sec ±2.42% (27 runs sampled), 10.5ms per run
current is faster
--------------------------------------
eachOfSeries(10) v2.3.0 x 55,509 ops/sec ±1.51% (30 runs sampled), 0.0180ms per run
eachOfSeries(10) current x 58,832 ops/sec ±2.64% (29 runs sampled), 0.0170ms per run
current is faster
--------------------------------------
eachOfSeries(300) v2.3.0 x 3,000 ops/sec ±1.96% (28 runs sampled), 0.333ms per run
eachOfSeries(300) current x 3,369 ops/sec ±1.01% (31 runs sampled), 0.297ms per run
current is faster
--------------------------------------
eachOfSeries(10000) v2.3.0 x 93.05 ops/sec ±1.88% (32 runs sampled), 10.7ms per run
eachOfSeries(10000) current x 103 ops/sec ±2.38% (30 runs sampled), 9.67ms per run
current is faster
--------------------------------------
parallel(10) v2.3.0 x 79,183 ops/sec ±1.84% (29 runs sampled), 0.0126ms per run
parallel(10) current x 93,866 ops/sec ±2.27% (30 runs sampled), 0.0107ms per run
current is faster
--------------------------------------
parallel(100) v2.3.0 x 17,376 ops/sec ±1.95% (30 runs sampled), 0.0576ms per run
parallel(100) current x 25,385 ops/sec ±2.02% (29 runs sampled), 0.0394ms per run
current is faster
--------------------------------------
parallel(1000) v2.3.0 x 1,952 ops/sec ±1.56% (29 runs sampled), 0.512ms per run
parallel(1000) current x 3,011 ops/sec ±2.09% (29 runs sampled), 0.332ms per run
current is faster
--------------------------------------
series(10) v2.3.0 x 48,445 ops/sec ±2.92% (24 runs sampled), 0.0206ms per run
series(10) current x 57,840 ops/sec ±1.63% (30 runs sampled), 0.0173ms per run
current is faster
--------------------------------------
series(100) v2.3.0 x 7,491 ops/sec ±1.65% (30 runs sampled), 0.133ms per run
series(100) current x 8,525 ops/sec ±2.66% (28 runs sampled), 0.117ms per run
current is faster
--------------------------------------
series(1000) v2.3.0 x 814 ops/sec ±2.15% (31 runs sampled), 1.23ms per run
series(1000) current x 957 ops/sec ±2.25% (30 runs sampled), 1.05ms per run
current is faster
--------------------------------------
waterfall(10) v2.3.0 x 42,529 ops/sec ±1.92% (30 runs sampled), 0.0235ms per run
waterfall(10) current x 47,484 ops/sec ±2.51% (30 runs sampled), 0.0211ms per run
current is faster
--------------------------------------
waterfall(100) v2.3.0 x 6,180 ops/sec ±2.59% (31 runs sampled), 0.162ms per run
waterfall(100) current x 7,322 ops/sec ±1.33% (27 runs sampled), 0.137ms per run
current is faster
--------------------------------------
waterfall(1000) v2.3.0 x 656 ops/sec ±1.26% (31 runs sampled), 1.52ms per run
waterfall(1000) current x 771 ops/sec ±1.50% (30 runs sampled), 1.30ms per run
current is faster
--------------------------------------
auto(5) v2.3.0 x 36,391 ops/sec ±2.34% (30 runs sampled), 0.0275ms per run
auto(5) current x 39,932 ops/sec ±2.13% (29 runs sampled), 0.0250ms per run
current is faster
--------------------------------------
auto(10) v2.3.0 x 20,148 ops/sec ±2.08% (29 runs sampled), 0.0496ms per run
auto(10) current x 21,765 ops/sec ±1.94% (29 runs sampled), 0.0459ms per run
current is faster
--------------------------------------
auto(100) v2.3.0 x 866 ops/sec ±1.97% (30 runs sampled), 1.16ms per run
auto(100) current x 903 ops/sec ±1.99% (29 runs sampled), 1.11ms per run
current is faster
--------------------------------------
current faster overall (34.8ms total vs. 40ms total)
current won more benchmarks (21 vs. 0)

We're also competitive with Bluebird again:

Bluebird benchmarks (click to expand)
results for 10000 parallel executions, 1 ms per I/O op

file time(ms) memory(MB)
callbacks-baseline.js 247 71.74
callbacks-suguru03-neo-async-parallel.js 369 89.66
callbacks-caolan-async-parallel.js 500 122.13
promises-bluebird-generator.js 528 105.73
promises-lvivski-davy.js 554 163.70
promises-bluebird.js 620 103.58
promises-cujojs-when.js 736 166.95
//...

Platform info:
Linux 4.8.0-41-generic x64
Node.JS 7.7.4
V8 5.5.372.42
Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz × 8


results for 10000 parallel executions, 1 ms per I/O op

file time(ms) memory(MB)
callbacks-baseline.js 113 26.87
callbacks-suguru03-neo-async-waterfall.js 147 36.47
promises-bluebird-generator.js 211 37.81
callbacks-caolan-async-waterfall.js 222 46.56
promises-bluebird.js 247 47.08
promises-then-promise.js 387 86.44
promises-lvivski-davy.js 399 99.09
promises-cujojs-when.js 403 72.13
promises-tildeio-rsvp.js 415 78.18
promises-dfilatov-vow.js 614 134.69
promises-calvinmetcalf-lie.js 677 162.16
generators-tj-co.js 713 112.79
//...

Platform info:
Linux 4.8.0-41-generic x64
Node.JS 7.7.4
V8 5.5.372.42
Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz × 8


It might be worth getting rid of rest() completely, but the gains will be less.

Nothing beats native ...rest params in ES6, though...

$ node perf/benchmark.js -g 'waterfall'
Latest tag is  v2.3.0
Comparing v2.3.0 with current on Node v7.7.4
--------------------------------------
waterfall(10) v2.3.0 x 41,937 ops/sec ±1.54% (23 runs sampled), 0.0238ms per run
waterfall(10) current x 56,884 ops/sec ±2.12% (29 runs sampled), 0.0176ms per run
current is faster
--------------------------------------
waterfall(100) v2.3.0 x 6,061 ops/sec ±2.00% (30 runs sampled), 0.165ms per run
waterfall(100) current x 8,876 ops/sec ±2.65% (28 runs sampled), 0.113ms per run
current is faster
--------------------------------------
waterfall(1000) v2.3.0 x 638 ops/sec ±3.13% (29 runs sampled), 1.57ms per run
waterfall(1000) current x 1,024 ops/sec ±2.06% (30 runs sampled), 0.977ms per run
current is faster
--------------------------------------
current faster overall (1.11ms total vs. 1.76ms total)
current won more benchmarks (3 vs. 0)

-----------

results for 10000 parallel executions, 1 ms per I/O op

file                                       time(ms)  memory(MB)
callbacks-baseline.js                           118       24.98
callbacks-caolan-async-waterfall.js             202       45.82
callbacks-suguru03-neo-async-waterfall.js       203       36.24
promises-bluebird-generator.js                  218       37.35
promises-bluebird.js                            264       53.38

@megawac
Copy link
Collaborator

megawac commented Apr 3, 2017

Unless JavaScript engines have changed how they handle arguments I question these results and suggest you test in a couple different environments (different versions of node, different versions of chrome, firefox, safari, etc).

@jdalton, @jridgewell and I investigated this pretty closely about 2 years ago. Reference ticket jashkenas/underscore#2140.

@megawac
Copy link
Collaborator

megawac commented Apr 3, 2017

Also we changed from slice to rest about a year ago after observing similar results when profiling

@hargasinski
Copy link
Collaborator

@megawac v8, at least with TurboFan, appears to be optimizing the arguments object. There's this design doc from their wiki that goes a little into the detail of TurboFan's vs CrankShaft's handling of arguments and what they are working on.

From what I understand from petkaantonov's Optimization killers and what the design doc says about CrankShaft, this used to be deoptimized, due to passing arguments to another function while overRest's implementation creates an Array inline. If v8 now

always materializes the objects right after the prologue if they’re being used in the function.

then using slice with arguments is a quick substitute for a rest parameter, and we don't need the other functionality that overRest provides.

But yes, this will (most likely) be slower in other browsers/older versions of Node.

lib/auto.js Outdated
@@ -4,7 +4,7 @@ import indexOf from 'lodash/_baseIndexOf';
import isArray from 'lodash/isArray';
import okeys from 'lodash/keys';
import noop from 'lodash/noop';
import rest from './internal/rest';
import slice from 'lodash/slice';
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use ./internal/slice here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, looks like my auto-import plugin picked up the wrong slice.

lib/seq.js Outdated
@@ -1,5 +1,6 @@
import noop from 'lodash/noop';
import rest from './internal/rest';
import slice from 'lodash/slice';
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as before, sorry, is it possible to use ./internal/slice here?

@aearly
Copy link
Collaborator Author

aearly commented Apr 4, 2017

I'm seeing a similar performance gain with node 0.12:

Latest tag is  v2.3.0
Comparing v2.3.0 with current on Node v0.12.18
--------------------------------------
parallel(10) v2.3.0 x 44,422 ops/sec ±2.65% (28 runs sampled), 0.0225ms per run
parallel(10) current x 53,069 ops/sec ±3.36% (29 runs sampled), 0.0188ms per run
current is faster
--------------------------------------
parallel(100) v2.3.0 x 11,965 ops/sec ±0.61% (32 runs sampled), 0.0836ms per run
parallel(100) current x 18,318 ops/sec ±3.63% (30 runs sampled), 0.0546ms per run
current is faster
--------------------------------------
parallel(1000) v2.3.0 x 1,375 ops/sec ±0.74% (31 runs sampled), 0.727ms per run
parallel(1000) current x 2,430 ops/sec ±3.29% (29 runs sampled), 0.412ms per run
current is faster
--------------------------------------
waterfall(10) v2.3.0 x 28,799 ops/sec ±1.66% (30 runs sampled), 0.0347ms per run
waterfall(10) current x 31,131 ops/sec ±4.33% (29 runs sampled), 0.0321ms per run
current is faster
--------------------------------------
waterfall(100) v2.3.0 x 4,649 ops/sec ±1.48% (31 runs sampled), 0.215ms per run
waterfall(100) current x 5,377 ops/sec ±1.88% (29 runs sampled), 0.186ms per run
current is faster
--------------------------------------
waterfall(1000) v2.3.0 x 492 ops/sec ±1.51% (32 runs sampled), 2.03ms per run
waterfall(1000) current x 580 ops/sec ±1.41% (31 runs sampled), 1.73ms per run
current is faster
--------------------------------------
current faster overall (2.43ms total vs. 3.12ms total)
current won more benchmarks (6 vs. 0)

(Benchmarks don't work in node 0.10). Currently working on a browser benchmark.

I looked at the benchmarks associated with that underscore ticket. The slice implementation uses arrayProto.slice.call(arguments) which will always be slower due to .call().

@megawac
Copy link
Collaborator

megawac commented Apr 4, 2017

In that case, there were also benchmarks created for lodash which was previously using _.slice before restParam was added; in that case IIRC the perf gain was slight but the biggest gain was readability.

lib/doDuring.js Outdated
@@ -28,11 +28,12 @@ export default function doDuring(fn, test, callback) {
var _fn = wrapAsync(fn);
var _test = wrapAsync(test);

var next = rest(function(err, args) {
if (err) return callback(err);
function next(err/*, args...*/) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mind changing comments to /*, ...args*/ (a bit better semantically)

@megawac
Copy link
Collaborator

megawac commented Apr 4, 2017

then using slice with arguments is a quick substitute for a rest parameter, and we don't need the other functionality that overRest provides.
But yes, this will (most likely) be slower in other browsers/older versions of Node.

I haven't looked into this in a while but it used to be debilitating in Firefox, okay in Chrome and IE, and pretty bad in Opera. Please make some browser tests before merging this

@@ -106,6 +105,37 @@ describe("waterfall", function () {
}).to.throw(/already called/);
});

it('multiple callback calls (trickier) @nodeonly', function(done){
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@suguru03 I think neo-async fails on this test case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll fix it, thanks. :)

@aearly
Copy link
Collaborator Author

aearly commented Apr 4, 2017

We started using rest between v1.3.0 and v1.4.0.

Node 7 Benchmarks

$ node perf/benchmark.js -g 'waterfall|parallel' v1.3.0 v1.4.0
Comparing v1.3.0 with v1.4.0 on Node v7.7.4
--------------------------------------
parallel(10) v1.3.0 x 95,854 ops/sec ±0.91% (27 runs sampled), 0.0104ms per run
parallel(10) v1.4.0 x 84,751 ops/sec ±2.31% (27 runs sampled), 0.0118ms per run
v1.3.0 is faster
--------------------------------------
parallel(100) v1.3.0 x 24,968 ops/sec ±1.95% (27 runs sampled), 0.0401ms per run
parallel(100) v1.4.0 x 20,407 ops/sec ±1.51% (28 runs sampled), 0.0490ms per run
v1.3.0 is faster
--------------------------------------
parallel(1000) v1.3.0 x 3,037 ops/sec ±1.52% (28 runs sampled), 0.329ms per run
parallel(1000) v1.4.0 x 2,303 ops/sec ±1.63% (29 runs sampled), 0.434ms per run
v1.3.0 is faster
--------------------------------------
waterfall(10) v1.3.0 x 19,206 ops/sec ±3.37% (24 runs sampled), 0.0521ms per run
waterfall(10) v1.4.0 x 18,883 ops/sec ±2.88% (27 runs sampled), 0.0530ms per run
Tie
--------------------------------------
waterfall(100) v1.3.0 x 2,589 ops/sec ±2.26% (29 runs sampled), 0.386ms per run
waterfall(100) v1.4.0 x 2,427 ops/sec ±3.31% (28 runs sampled), 0.412ms per run
v1.3.0 is faster
--------------------------------------
waterfall(1000) v1.3.0 x 271 ops/sec ±3.85% (28 runs sampled), 3.69ms per run
waterfall(1000) v1.4.0 x 243 ops/sec ±2.90% (27 runs sampled), 4.11ms per run
v1.3.0 is faster
--------------------------------------
v1.3.0 faster overall (4.51ms total vs. 5.07ms total)
v1.3.0 won more benchmarks (5 vs. 0)


Node 0.10 Benchmarks

$ node perf/benchmark.js -g 'waterfall|parallel' v1.3.0 v1.4.0
Comparing v1.3.0 with v1.4.0 on Node v0.10.48
--------------------------------------
parallel(10) v1.3.0 x 36,982 ops/sec ±3.28% (26 runs sampled), 0.0270ms per run
parallel(10) v1.4.0 x 34,188 ops/sec ±2.63% (26 runs sampled), 0.0293ms per run
v1.3.0 is faster
--------------------------------------
parallel(100) v1.3.0 x 9,111 ops/sec ±2.07% (30 runs sampled), 0.110ms per run
parallel(100) v1.4.0 x 7,572 ops/sec ±2.34% (27 runs sampled), 0.132ms per run
v1.3.0 is faster
--------------------------------------
parallel(1000) v1.3.0 x 1,051 ops/sec ±3.07% (27 runs sampled), 0.952ms per run
parallel(1000) v1.4.0 x 782 ops/sec ±3.82% (28 runs sampled), 1.28ms per run
v1.3.0 is faster
--------------------------------------
waterfall(10) v1.3.0 x 20,187 ops/sec ±2.54% (27 runs sampled), 0.0495ms per run
waterfall(10) v1.4.0 x 17,901 ops/sec ±2.79% (26 runs sampled), 0.0559ms per run
v1.3.0 is faster
--------------------------------------
waterfall(100) v1.3.0 x 2,892 ops/sec ±2.27% (25 runs sampled), 0.346ms per run
waterfall(100) v1.4.0 x 2,662 ops/sec ±1.64% (27 runs sampled), 0.376ms per run
v1.3.0 is faster
--------------------------------------
waterfall(1000) v1.3.0 x 315 ops/sec ±2.22% (31 runs sampled), 3.17ms per run
waterfall(1000) v1.4.0 x 285 ops/sec ±3.53% (25 runs sampled), 3.51ms per run
v1.3.0 is faster
--------------------------------------
v1.3.0 faster overall (4.66ms total vs. 5.38ms total)
v1.3.0 won more benchmarks (6 vs. 0)


It was a 10-12% performance regression. This PR uses a more streamlined version of slice, so the gains are a bit more.

Still working on the browser benchmark, jsperf.com is being a PITA right now....

@aearly
Copy link
Collaborator Author

aearly commented Apr 4, 2017

Here's a preliminary benchmark: https://jsperf.com/async-rest-vs-slice8/1

It's really uninteresting, because the 1ms+ setImmediate hides all perf differences.

@aearly
Copy link
Collaborator Author

aearly commented Apr 4, 2017

Here's a more throughput-ey test: https://jsperf.com/async-rest-vs-slice9/1

The slice strategy seems to a small advantage here.

lib/apply.js Outdated
@@ -44,8 +45,10 @@ import rest from './internal/rest';
* two
* three
*/
export default rest(function(fn, args) {
return rest(function(callArgs) {
export default function(fn/*, args*/) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/, ...args/

wrapAsync(fn).apply(null, args.concat(rest(function (err, args) {
return function (fn/*, ...args*/) {
var args = slice(arguments, 1);
wrapAsync(fn).apply(null, args.concat(function (err/*, ...args*/) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: args.push here could avoid an extra array being created


export default function (fn) {
return rest(function (args/*..., callback*/) {
return function (/*...args, callback*/) {
var args = slice(arguments);
var callback = args.pop();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: pop can be avoided, slice(arguments, 0, -1), callback = arguments[arguments.length - 1]

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm gonna leave as-is. The streamlined version of slice doesn't support the 3rd end param.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good

lib/reflect.js Outdated
value = cbArgs[0];
} else if (cbArgs.length > 1) {
value = cbArgs;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I prefer this condition, except removing the else if and replacing with an else

lib/waterfall.js Outdated
@@ -68,24 +68,23 @@ export default function(tasks, callback) {
if (!isArray(tasks)) return callback(new Error('First argument to waterfall must be an array of functions'));
if (!tasks.length) return callback();
var taskIndex = 0;
var args = [];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of creating this variable I would consider modifying nextTask to take args as a param (nextTasks(args)). In next, we would just call nextTask(slice(arguments, 1))

lib/whilst.js Outdated
if (err) return callback(err);
if (test()) return _iteratee(next);
callback.apply(null, [null].concat(args));
});
callback.apply(null, [null].concat(slice(arguments, 1)));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I long for splats 💨

@megawac
Copy link
Collaborator

megawac commented Apr 4, 2017

Thanks for looking into the perf

@@ -1,7 +1,7 @@
import isArray from 'lodash/isArray';
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice cleanup of waterfall ⭐

@aearly
Copy link
Collaborator Author

aearly commented Apr 5, 2017

This benchmark might shed some light into the performance numbers: https://jsperf.com/arguments-handling4/1

The inline slice strategy is the fastest (but ugliest), followed by a statically wrapped restParams function, then by a specialized slice function. The [].slice.call strategy and a dynamically wrapped restParams function trade off for last place, wether you're in in Chrome or Firefox.

We were essentially doing the dynamically wrapped version of restParams, by creating a wrapped function in each iteration of parallel or waterfall, leaving room for a specialized slice function.

return rest(function(callArgs) {
export default function(fn/*, ...args*/) {
var args = slice(arguments, 1);
return function(/*callArgs*/) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: ...callArgs

Copy link
Collaborator

@hargasinski hargasinski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work!

@aearly aearly merged commit b3679d5 into master Apr 7, 2017
@aearly aearly deleted the waterfall-optimization branch April 7, 2017 05:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants