Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Process startup time on ARM #2138

Closed
rvagg opened this issue Jul 9, 2015 · 24 comments
Closed

Process startup time on ARM #2138

rvagg opened this issue Jul 9, 2015 · 24 comments
Labels
arm Issues and PRs related to the ARM platform. performance Issues and PRs related to the performance of Node.js.

Comments

@rvagg
Copy link
Member

rvagg commented Jul 9, 2015

@nodejs/build are talking about how to improve test time on the ARM machines and we're tackling it on multiple fronts. However, by far the biggest challenge we're facing with the test suite is node process startup time and this isn't something we can do much about from our end so I'm wondering if others here (@bnoordhuis, @trevnorris, others?) can offer suggestions about if and how we might be able to improve things.

On a Pi1 B+ (ARMv6):

$ time ./iojs -pe process.versions
{ http_parser: '2.5.0',
  node: '2.3.4-pre',
  v8: '4.2.77.20',
  uv: '1.6.1',
  zlib: '1.2.8',
  ares: '1.10.1-DEV',
  modules: '44',
  openssl: '1.0.2c' }

real    0m3.148s
user    0m2.950s
sys 0m0.160s

$ time ./iojs -pe true
true

real    0m3.172s
user    0m2.990s
sys 0m0.140s
$ time ./iojs -v
v2.3.4-pre

real    0m0.035s
user    0m0.010s
sys 0m0.010s

Which explains why we can't get below ~2.5h for build+test (we've managed to squeeze the build part of that down to ~12m with aggressive caching strategies).

Then this:

$ strace ./iojs -pe true 2>&1 | wc
   3690   23609  218443

vs this on an x64 machine:

$ strace ./iojs -pe true 2>&1 | wc
    327    2038   23875

Which seems like a huge discrepancy to me, even considering the architecture differences.

I can't recall the details of what happened with snapshots on ARM but they are currently on by default, even on ARM now:

$ grep snapshot config.gypi 
                 'v8_use_snapshot': 1,

Anyone want to take a look or suggest ways to investigate this further?

@rvagg
Copy link
Member Author

rvagg commented Jul 9, 2015

also /cc @wolfeidau

@trevnorris
Copy link
Contributor

I'd recommend starting by getting times for every phase of startup.

@bnoordhuis Would that be possible using perf and setting the sampling interval super low? Other than that I can't think of anything other than patching the code to print the times.

@wolfeidau
Copy link

Have you tried something as simple as running it in /dev/shm just to see how much effect IO is having on loading the suite.

Pretty much rsync over the project and give it a whirl.

NOTE: This uses shared memory so you probably wanna do it on a newer arm device like the Pi2 which has more RAM.

@bnoordhuis
Copy link
Member

@wolfeidau's suggestion is good.

@trevnorris perf stat -r 10 iojs -e 0 and perf record -F 1009 sh -c 'for i in {1..10}; do iojs -e 0; done' (followed by perf report) should do it. Maybe try alternating -F 1009 with -c 40000.

@rvagg I'd be interested to know what perf evlist -v prints on the devices.

@silverwind
Copy link
Contributor

@wolfeidau doesn't seem like disk i/o is a limiting factor on a RPi2, armv7:

/dev/shm/iojs -p process.versions  0.93s user 0.08s system 99% cpu 1.019 total
/dev/shm/iojs -p process.versions  1.00s user 0.07s system 99% cpu 1.074 total
/dev/shm/iojs -p process.versions  1.01s user 0.02s system 99% cpu 1.038 total
/dev/shm/iojs -p process.versions  0.96s user 0.02s system 99% cpu 0.988 total
/dev/shm/iojs -p process.versions  0.85s user 0.08s system 99% cpu 0.935 total
/dev/shm/iojs -p process.versions  1.00s user 0.03s system 99% cpu 1.040 total
/dev/shm/iojs -p process.versions  1.02s user 0.05s system 99% cpu 1.075 total
/usr/local/bin/iojs -p process.versions  1.00s user 0.05s system 99% cpu 1.051 total
/usr/local/bin/iojs -p process.versions  0.93s user 0.04s system 99% cpu 0.976 total
/usr/local/bin/iojs -p process.versions  1.01s user 0.06s system 99% cpu 1.080 total
/usr/local/bin/iojs -p process.versions  0.98s user 0.06s system 99% cpu 1.045 total
/usr/local/bin/iojs -p process.versions  0.96s user 0.04s system 99% cpu 1.004 total
/usr/local/bin/iojs -p process.versions  0.90s user 0.06s system 98% cpu 0.970 total
/usr/local/bin/iojs -p process.versions  0.99s user 0.06s system 99% cpu 1.052 total

$ dd if=/dev/zero of=test bs=4096 count=100000
100000+0 records in
100000+0 records out
409600000 bytes (410 MB) copied, 44.3668 s, 9.2 MB/s
$ reboot # to avoid caching
$ dd if=test of=/dev/null bs=4096 count=100000
100000+0 records in
100000+0 records out
409600000 bytes (410 MB) copied, 21.2704 s, 19.3 MB/s

@mathiask88
Copy link
Contributor

What about clustering some PIs? Just an idea. I dont have any experience with that.

@rvagg
Copy link
Member Author

rvagg commented Jul 10, 2015

Latest optimisation I've attempted is mounting the whole build directory for Jenkins via NFS (from a shared SSD). It's not as dramatic as I hoped, a marginal speed increase perhaps. Overall test runs total times do seem to have come down a little.

One hold-up though is that we're getting consistent failures on pi2 builds (not pi1+ interestingly!) for both of these now which sound NFS related: test-fs-readfilesync-pipe-large.js test-fs-readfile-pipe-large.js, e.g.: https://jenkins-iojs.nodesource.com/job/iojs+pr+arm/nodes=pi2-raspbian-wheezy/124/console

I'm either going to have to figure that our or revert the NFS introduction on pi2. Any help there would be appreciated.

@mathiask88 the next optimisation we have underway is actually doing that, in a simple form, having test.py be able to (1) skip X number of initial tests and (2) run only every Y number of tests. Then we can cluster 2 or 3 pis together and have them all build but only run every 2 or 3 tests. @orangemocha has the initial work for this on his TODO list.

But, given that the greatest cost is this process startup time I'm hoping that someone can help work it out.

$ perf stat -r 10 ./iojs -e 0

 Performance counter stats for './iojs -e 0' (10 runs):

       2882.757600      task-clock (msec)         #    0.990 CPUs utilized            ( +-  0.27% )
               120      context-switches          #    0.042 K/sec                    ( +-  1.95% )
                 0      cpu-migrations            #    0.000 K/sec                  
             2,240      page-faults               #    0.777 K/sec                    ( +-  0.29% )
     1,986,814,275      cycles                    #    0.689 GHz                      ( +-  0.24% ) [49.99%]
     1,049,532,425      stalled-cycles-frontend   #   52.82% frontend cycles idle     ( +-  0.24% ) [50.14%]
         7,850,846      stalled-cycles-backend    #    0.40% backend  cycles idle     ( +-  1.67% ) [50.42%]
       354,572,049      instructions              #    0.18  insns per cycle        
                                                  #    2.96  stalled cycles per insn  ( +-  0.20% ) [33.96%]
        55,000,800      branches                  #   19.079 M/sec                    ( +-  0.33% ) [33.68%]
         8,304,929      branch-misses             #   15.10% of all branches          ( +-  0.28% ) [33.38%]

       2.912146671 seconds time elapsed                                          ( +-  0.41% )

I do this:

$ perf record -F 1009 sh -c 'for i in {1..10}; do ./iojs -e 0; done'
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.004 MB perf.data (~194 samples) ]

but perf report says there are no samples.

@rvagg
Copy link
Member Author

rvagg commented Jul 10, 2015

pretty sure that failure was because of sh vs bash and the {1..10} which I don't think dash supports.

$ perf record -v -F 1009 bash -c 'for i in {1..10}; do ./iojs -e 0; done'
mmap size 528384B
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.028 MB perf.data (~1212 samples) ]
failed to write feature 2
failed to write feature 9
failed to write feature 14

But I still get no samples for perf report

$ perf evlist -v
cycles: sample_freq=1009, size: 96, sample_type: IP|TID|TIME|PERIOD, disabled: 1, inherit: 1, mmap: 1, mmap2: 1, comm: 1, comm_exec: 1, freq: 1, enable_on_exec: 1, sample_id_all: 1, exclude_guest: 1

@rvagg
Copy link
Member Author

rvagg commented Jul 10, 2015

Playing with -F and -c doesn't seem to make a difference or get me any helpful samples. This is perf 3.18 btw.

@brendanashworth brendanashworth added the arm Issues and PRs related to the ARM platform. label Jul 13, 2015
@brendanashworth
Copy link
Contributor

This will probably help: #2248

@Fishrock123
Copy link
Contributor

@rvagg any idea if this has improved at all?

@rvagg
Copy link
Member Author

rvagg commented Aug 28, 2015

I think we've had a minor improvement via some recent changes, still too slow and it's doing so much more on arm than x86.

@ronkorving
Copy link
Contributor

With the latest commits that re-enable some previously broken V8 ARM optimizations, is there any improvement?

@rvagg
Copy link
Member Author

rvagg commented Sep 4, 2015

non-scientific assessment is that process startup seems to have improved somewhat on ARM, test runs times are coming down with all of the various ways we are attacking this

@ronkorving
Copy link
Contributor

Awesome work. By the way, do we know how much of the slowness comes from V8, libuv or Node itself? I wonder what the ratio is there. (the strace discrepancy points in the direction of libuv or V8-level issues, does it not?)

@reqshark
Copy link

reqshark commented Sep 4, 2015

That's a good question @ronkorving, I would think V8 costs more to start up than libuv because of the amount of memory it will try to provision for itself ahead of time.. but remember there's always high overhead with process startup, even in pure C forking carries a big overhead with all that extra stuff we're asking of the kernel.

@bnoordhuis
Copy link
Member

@rvagg If you can give me access to an ARM machine, I'll try some profiling. I may need root access.

Just a hunch but I speculate that dynamic linker overhead, OpenSSL entropy pool seeding and V8 snapshot deserialization are the biggest cost centers.

@rvagg
Copy link
Member Author

rvagg commented Sep 4, 2015

@bnoordhuis ssh pi@104.131.173.199 -p 2201, that user has sudo access, the iojs user has all the ccache data in /home/iojs/.ccache if you compile with PATH=/usr/lib/ccache/:$PATH, I've taken it out of CI for now, let me know when you're done with it.

@bnoordhuis
Copy link
Member

@rvagg I can't connect. The address is pingable and the port is open but nothing is sent when I connect with nc.

@rvagg
Copy link
Member Author

rvagg commented Sep 4, 2015

@bnoordhuis try again, also maybe install screen/tmux on the pi if/when you get in just in case it fails again .. last time I did this for someone it was flaky as well and I'm not sure why

@bnoordhuis
Copy link
Member

@rvagg Thanks, I'm done.

I speculate that dynamic linker overhead, OpenSSL entropy pool seeding and V8 snapshot deserialization are the biggest cost centers.

Interestingly enough, it's none of the above: the largest cost centers for a null program turn out to be parsing and code generation... our bootstrap code comes to 210 kB now.

I can think of ways of minimizing that - mostly by splitting up src/node.js and some of the files in lib/ into smaller files and only loading what is necessary - but that's a fair amount of work.

For the record, below are all the scripts that are loaded for an empty program:

4.0K    lib/_linklist.js
4.0K    lib/_stream_duplex.js
4.0K    lib/_stream_passthrough.js
24K     lib/_stream_readable.js
8.0K    lib/_stream_transform.js
16K     lib/_stream_writable.js
12K     lib/assert.js
24K     lib/buffer.js
4.0K    lib/constants.js
12K     lib/events.js
48K     lib/fs.js
4.0K    lib/internal/util.js
16K     lib/module.js
16K     lib/path.js
4.0K    lib/stream.js
16K     lib/timers.js
24K     lib/util.js
4.0K    lib/vm.js
28K     src/node.js

lib/assert.js is probably the lowest hanging fruit, it's only used by lib/module.js and lib/timers.js.

@Fishrock123
Copy link
Contributor

Hmm, I was under the impression that should be less of a deal now that we've turned snapshots on.

@ofrobots / @domenic, any way we can improve this other than not loading code?

Can any of this be like, precompiled?

@domenic
Copy link
Contributor

domenic commented Sep 5, 2015

Only skimmed the thread so far, but very possibly this might be a good place to apply my V8 extras work, by precompiling code in to the snapshot. https://bit.ly/v8-extras

@bnoordhuis
Copy link
Member

I think that could work. Modules are basically IIFEs and don't execute code until first use.

I played around with moving code around and splitting up files but the results are only so-so for the non-trivial case: #2714

@ChALkeR ChALkeR added the performance Issues and PRs related to the performance of Node.js. label Feb 16, 2016
@rvagg rvagg closed this as completed Jun 30, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arm Issues and PRs related to the ARM platform. performance Issues and PRs related to the performance of Node.js.
Projects
None yet
Development

No branches or pull requests