Process startup time on ARM #2138

rvagg · 2015-07-09T00:43:28Z

@nodejs/build are talking about how to improve test time on the ARM machines and we're tackling it on multiple fronts. However, by far the biggest challenge we're facing with the test suite is node process startup time and this isn't something we can do much about from our end so I'm wondering if others here (@bnoordhuis, @trevnorris, others?) can offer suggestions about if and how we might be able to improve things.

On a Pi1 B+ (ARMv6):

$ time ./iojs -pe process.versions
{ http_parser: '2.5.0',
  node: '2.3.4-pre',
  v8: '4.2.77.20',
  uv: '1.6.1',
  zlib: '1.2.8',
  ares: '1.10.1-DEV',
  modules: '44',
  openssl: '1.0.2c' }

real    0m3.148s
user    0m2.950s
sys 0m0.160s

$ time ./iojs -pe true
true

real    0m3.172s
user    0m2.990s
sys 0m0.140s
$ time ./iojs -v
v2.3.4-pre

real    0m0.035s
user    0m0.010s
sys 0m0.010s

Which explains why we can't get below ~2.5h for build+test (we've managed to squeeze the build part of that down to ~12m with aggressive caching strategies).

Then this:

$ strace ./iojs -pe true 2>&1 | wc
   3690   23609  218443

vs this on an x64 machine:

$ strace ./iojs -pe true 2>&1 | wc
    327    2038   23875

Which seems like a huge discrepancy to me, even considering the architecture differences.

I can't recall the details of what happened with snapshots on ARM but they are currently on by default, even on ARM now:

$ grep snapshot config.gypi 
                 'v8_use_snapshot': 1,

Anyone want to take a look or suggest ways to investigate this further?

The text was updated successfully, but these errors were encountered:

rvagg · 2015-07-09T00:50:10Z

also /cc @wolfeidau

trevnorris · 2015-07-09T02:10:36Z

I'd recommend starting by getting times for every phase of startup.

@bnoordhuis Would that be possible using perf and setting the sampling interval super low? Other than that I can't think of anything other than patching the code to print the times.

wolfeidau · 2015-07-09T02:28:55Z

Have you tried something as simple as running it in /dev/shm just to see how much effect IO is having on loading the suite.

Pretty much rsync over the project and give it a whirl.

NOTE: This uses shared memory so you probably wanna do it on a newer arm device like the Pi2 which has more RAM.

bnoordhuis · 2015-07-09T09:11:14Z

@wolfeidau's suggestion is good.

@trevnorris perf stat -r 10 iojs -e 0 and perf record -F 1009 sh -c 'for i in {1..10}; do iojs -e 0; done' (followed by perf report) should do it. Maybe try alternating -F 1009 with -c 40000.

@rvagg I'd be interested to know what perf evlist -v prints on the devices.

silverwind · 2015-07-09T10:29:04Z

@wolfeidau doesn't seem like disk i/o is a limiting factor on a RPi2, armv7:

/dev/shm/iojs -p process.versions  0.93s user 0.08s system 99% cpu 1.019 total
/dev/shm/iojs -p process.versions  1.00s user 0.07s system 99% cpu 1.074 total
/dev/shm/iojs -p process.versions  1.01s user 0.02s system 99% cpu 1.038 total
/dev/shm/iojs -p process.versions  0.96s user 0.02s system 99% cpu 0.988 total
/dev/shm/iojs -p process.versions  0.85s user 0.08s system 99% cpu 0.935 total
/dev/shm/iojs -p process.versions  1.00s user 0.03s system 99% cpu 1.040 total
/dev/shm/iojs -p process.versions  1.02s user 0.05s system 99% cpu 1.075 total
/usr/local/bin/iojs -p process.versions  1.00s user 0.05s system 99% cpu 1.051 total
/usr/local/bin/iojs -p process.versions  0.93s user 0.04s system 99% cpu 0.976 total
/usr/local/bin/iojs -p process.versions  1.01s user 0.06s system 99% cpu 1.080 total
/usr/local/bin/iojs -p process.versions  0.98s user 0.06s system 99% cpu 1.045 total
/usr/local/bin/iojs -p process.versions  0.96s user 0.04s system 99% cpu 1.004 total
/usr/local/bin/iojs -p process.versions  0.90s user 0.06s system 98% cpu 0.970 total
/usr/local/bin/iojs -p process.versions  0.99s user 0.06s system 99% cpu 1.052 total

$ dd if=/dev/zero of=test bs=4096 count=100000
100000+0 records in
100000+0 records out
409600000 bytes (410 MB) copied, 44.3668 s, 9.2 MB/s
$ reboot # to avoid caching
$ dd if=test of=/dev/null bs=4096 count=100000
100000+0 records in
100000+0 records out
409600000 bytes (410 MB) copied, 21.2704 s, 19.3 MB/s

mathiask88 · 2015-07-09T16:53:39Z

What about clustering some PIs? Just an idea. I dont have any experience with that.

rvagg · 2015-07-10T01:33:11Z

Latest optimisation I've attempted is mounting the whole build directory for Jenkins via NFS (from a shared SSD). It's not as dramatic as I hoped, a marginal speed increase perhaps. Overall test runs total times do seem to have come down a little.

One hold-up though is that we're getting consistent failures on pi2 builds (not pi1+ interestingly!) for both of these now which sound NFS related: test-fs-readfilesync-pipe-large.js test-fs-readfile-pipe-large.js, e.g.: https://jenkins-iojs.nodesource.com/job/iojs+pr+arm/nodes=pi2-raspbian-wheezy/124/console

I'm either going to have to figure that our or revert the NFS introduction on pi2. Any help there would be appreciated.

@mathiask88 the next optimisation we have underway is actually doing that, in a simple form, having test.py be able to (1) skip X number of initial tests and (2) run only every Y number of tests. Then we can cluster 2 or 3 pis together and have them all build but only run every 2 or 3 tests. @orangemocha has the initial work for this on his TODO list.

But, given that the greatest cost is this process startup time I'm hoping that someone can help work it out.

$ perf stat -r 10 ./iojs -e 0

 Performance counter stats for './iojs -e 0' (10 runs):

       2882.757600      task-clock (msec)         #    0.990 CPUs utilized            ( +-  0.27% )
               120      context-switches          #    0.042 K/sec                    ( +-  1.95% )
                 0      cpu-migrations            #    0.000 K/sec                  
             2,240      page-faults               #    0.777 K/sec                    ( +-  0.29% )
     1,986,814,275      cycles                    #    0.689 GHz                      ( +-  0.24% ) [49.99%]
     1,049,532,425      stalled-cycles-frontend   #   52.82% frontend cycles idle     ( +-  0.24% ) [50.14%]
         7,850,846      stalled-cycles-backend    #    0.40% backend  cycles idle     ( +-  1.67% ) [50.42%]
       354,572,049      instructions              #    0.18  insns per cycle        
                                                  #    2.96  stalled cycles per insn  ( +-  0.20% ) [33.96%]
        55,000,800      branches                  #   19.079 M/sec                    ( +-  0.33% ) [33.68%]
         8,304,929      branch-misses             #   15.10% of all branches          ( +-  0.28% ) [33.38%]

       2.912146671 seconds time elapsed                                          ( +-  0.41% )

I do this:

$ perf record -F 1009 sh -c 'for i in {1..10}; do ./iojs -e 0; done'
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.004 MB perf.data (~194 samples) ]

but perf report says there are no samples.

rvagg · 2015-07-10T01:44:17Z

pretty sure that failure was because of sh vs bash and the {1..10} which I don't think dash supports.

$ perf record -v -F 1009 bash -c 'for i in {1..10}; do ./iojs -e 0; done'
mmap size 528384B
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.028 MB perf.data (~1212 samples) ]
failed to write feature 2
failed to write feature 9
failed to write feature 14

But I still get no samples for perf report

$ perf evlist -v
cycles: sample_freq=1009, size: 96, sample_type: IP|TID|TIME|PERIOD, disabled: 1, inherit: 1, mmap: 1, mmap2: 1, comm: 1, comm_exec: 1, freq: 1, enable_on_exec: 1, sample_id_all: 1, exclude_guest: 1

rvagg · 2015-07-10T01:50:26Z

Playing with -F and -c doesn't seem to make a difference or get me any helpful samples. This is perf 3.18 btw.

brendanashworth · 2015-07-26T20:29:54Z

This will probably help: #2248

Fishrock123 · 2015-08-27T15:09:05Z

@rvagg any idea if this has improved at all?

rvagg · 2015-08-28T06:06:28Z

I think we've had a minor improvement via some recent changes, still too slow and it's doing so much more on arm than x86.

ronkorving · 2015-09-04T03:04:19Z

With the latest commits that re-enable some previously broken V8 ARM optimizations, is there any improvement?

rvagg · 2015-09-04T04:55:26Z

non-scientific assessment is that process startup seems to have improved somewhat on ARM, test runs times are coming down with all of the various ways we are attacking this

ronkorving · 2015-09-04T05:12:32Z

Awesome work. By the way, do we know how much of the slowness comes from V8, libuv or Node itself? I wonder what the ratio is there. (the strace discrepancy points in the direction of libuv or V8-level issues, does it not?)

reqshark · 2015-09-04T05:28:44Z

That's a good question @ronkorving, I would think V8 costs more to start up than libuv because of the amount of memory it will try to provision for itself ahead of time.. but remember there's always high overhead with process startup, even in pure C forking carries a big overhead with all that extra stuff we're asking of the kernel.

bnoordhuis · 2015-09-04T10:25:42Z

@rvagg If you can give me access to an ARM machine, I'll try some profiling. I may need root access.

Just a hunch but I speculate that dynamic linker overhead, OpenSSL entropy pool seeding and V8 snapshot deserialization are the biggest cost centers.

rvagg · 2015-09-04T10:49:15Z

@bnoordhuis ssh pi@104.131.173.199 -p 2201, that user has sudo access, the iojs user has all the ccache data in /home/iojs/.ccache if you compile with PATH=/usr/lib/ccache/:$PATH, I've taken it out of CI for now, let me know when you're done with it.

bnoordhuis · 2015-09-04T11:32:52Z

@rvagg I can't connect. The address is pingable and the port is open but nothing is sent when I connect with nc.

rvagg · 2015-09-04T12:28:52Z

@bnoordhuis try again, also maybe install screen/tmux on the pi if/when you get in just in case it fails again .. last time I did this for someone it was flaky as well and I'm not sure why

bnoordhuis · 2015-09-05T19:18:31Z

@rvagg Thanks, I'm done.

I speculate that dynamic linker overhead, OpenSSL entropy pool seeding and V8 snapshot deserialization are the biggest cost centers.

Interestingly enough, it's none of the above: the largest cost centers for a null program turn out to be parsing and code generation... our bootstrap code comes to 210 kB now.

I can think of ways of minimizing that - mostly by splitting up src/node.js and some of the files in lib/ into smaller files and only loading what is necessary - but that's a fair amount of work.

For the record, below are all the scripts that are loaded for an empty program:

4.0K    lib/_linklist.js
4.0K    lib/_stream_duplex.js
4.0K    lib/_stream_passthrough.js
24K     lib/_stream_readable.js
8.0K    lib/_stream_transform.js
16K     lib/_stream_writable.js
12K     lib/assert.js
24K     lib/buffer.js
4.0K    lib/constants.js
12K     lib/events.js
48K     lib/fs.js
4.0K    lib/internal/util.js
16K     lib/module.js
16K     lib/path.js
4.0K    lib/stream.js
16K     lib/timers.js
24K     lib/util.js
4.0K    lib/vm.js
28K     src/node.js

lib/assert.js is probably the lowest hanging fruit, it's only used by lib/module.js and lib/timers.js.

Fishrock123 · 2015-09-05T19:25:22Z

Hmm, I was under the impression that should be less of a deal now that we've turned snapshots on.

@ofrobots / @domenic, any way we can improve this other than not loading code?

Can any of this be like, precompiled?

domenic · 2015-09-05T19:34:39Z

Only skimmed the thread so far, but very possibly this might be a good place to apply my V8 extras work, by precompiling code in to the snapshot. https://bit.ly/v8-extras

bnoordhuis · 2015-09-06T11:44:16Z

I think that could work. Modules are basically IIFEs and don't execute code until first use.

I played around with moving code around and splitting up files but the results are only so-so for the non-trivial case: #2714

brendanashworth added the arm Issues and PRs related to the ARM platform. label Jul 13, 2015

bnoordhuis mentioned this issue Sep 6, 2015

lib,src: split up big builtin scripts #2714

Closed

ChALkeR added the performance Issues and PRs related to the performance of Node.js. label Feb 16, 2016

rvagg closed this as completed Jun 30, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Process startup time on ARM #2138

Process startup time on ARM #2138

rvagg commented Jul 9, 2015

rvagg commented Jul 9, 2015

trevnorris commented Jul 9, 2015

wolfeidau commented Jul 9, 2015

bnoordhuis commented Jul 9, 2015

silverwind commented Jul 9, 2015

mathiask88 commented Jul 9, 2015

rvagg commented Jul 10, 2015

rvagg commented Jul 10, 2015

rvagg commented Jul 10, 2015

brendanashworth commented Jul 26, 2015

Fishrock123 commented Aug 27, 2015

rvagg commented Aug 28, 2015

ronkorving commented Sep 4, 2015

rvagg commented Sep 4, 2015

ronkorving commented Sep 4, 2015

reqshark commented Sep 4, 2015

bnoordhuis commented Sep 4, 2015

rvagg commented Sep 4, 2015

bnoordhuis commented Sep 4, 2015

rvagg commented Sep 4, 2015

bnoordhuis commented Sep 5, 2015

Fishrock123 commented Sep 5, 2015

domenic commented Sep 5, 2015

bnoordhuis commented Sep 6, 2015

Process startup time on ARM #2138

Process startup time on ARM #2138

Comments

rvagg commented Jul 9, 2015

rvagg commented Jul 9, 2015

trevnorris commented Jul 9, 2015

wolfeidau commented Jul 9, 2015

bnoordhuis commented Jul 9, 2015

silverwind commented Jul 9, 2015

mathiask88 commented Jul 9, 2015

rvagg commented Jul 10, 2015

rvagg commented Jul 10, 2015

rvagg commented Jul 10, 2015

brendanashworth commented Jul 26, 2015

Fishrock123 commented Aug 27, 2015

rvagg commented Aug 28, 2015

ronkorving commented Sep 4, 2015

rvagg commented Sep 4, 2015

ronkorving commented Sep 4, 2015

reqshark commented Sep 4, 2015

bnoordhuis commented Sep 4, 2015

rvagg commented Sep 4, 2015

bnoordhuis commented Sep 4, 2015

rvagg commented Sep 4, 2015

bnoordhuis commented Sep 5, 2015

Fishrock123 commented Sep 5, 2015

domenic commented Sep 5, 2015

bnoordhuis commented Sep 6, 2015