Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP Move event loop into JavaScript. #1003

Closed
wants to merge 1 commit into from
Closed

Conversation

ry
Copy link
Member

@ry ry commented Oct 16, 2018

No description provided.

@ry ry requested a review from piscisaureus October 16, 2018 21:25
@ry
Copy link
Member Author

ry commented Oct 17, 2018

Before:

gpu ~/src/deno> ./tools/throughput_benchmark.py out/before/deno 5000
listening on 127.0.0.1:4544
head -c 5242880000 /dev/zero | nc 127.0.0.1 4544
34.2299828529 seconds

After:

gpu ~/src/deno> ./tools/throughput_benchmark.py out/after/deno 5000
listening on 127.0.0.1:4544
head -c 5242880000 /dev/zero | nc 127.0.0.1 4544
18.5639469624 seconds

So that's better - but when looking at the http benchmark it's worse:

gpu ~/src/deno> ./tools/http_benchmark.py out/before/deno
http_benchmark testing DENO.
Listening on 127.0.0.1:4544
third_party/wrk/linux/wrk -d 10s http://127.0.0.1:4544/
Running 10s test @ http://127.0.0.1:4544/
  2 threads and 10 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   617.02us    0.97ms  17.04ms   95.02%
    Req/Sec    11.29k     1.50k   13.44k    69.31%
  226927 requests in 10.10s, 11.04MB read
Requests/sec:  22467.66
Transfer/sec:      1.09MB
[...]
gpu ~/src/deno> ./tools/http_benchmark.py out/after/deno
http_benchmark testing DENO.
Listening on 127.0.0.1:4544
third_party/wrk/linux/wrk -d 10s http://127.0.0.1:4544/
Running 10s test @ http://127.0.0.1:4544/
  2 threads and 10 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   779.80us    1.09ms  17.86ms   94.88%
    Req/Sec     8.46k   746.00     9.39k    92.00%
  168428 requests in 10.00s, 8.19MB read
Requests/sec:  16840.40
Transfer/sec:    838.73KB

Here's what the strace looks like on the http server:

gpu ~/src/deno> strace -c out/before/deno tests/http_bench.ts --allow-net
Listening on 127.0.0.1:4500
^Cstrace: Process 21364 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00    0.035290           0    381195       171 futex
  0.00    0.000000           0        18           read
  0.00    0.000000           0         1           write
  0.00    0.000000           0        14           open
  0.00    0.000000           0        25           close
  0.00    0.000000           0         2           stat
  0.00    0.000000           0        13           fstat
  0.00    0.000000           0       145           mmap
  0.00    0.000000           0       712           mprotect
  0.00    0.000000           0       112           munmap
  0.00    0.000000           0         2           brk
  0.00    0.000000           0         5           rt_sigaction
  0.00    0.000000           0         1           rt_sigprocmask
  0.00    0.000000           0         3         1 ioctl
  0.00    0.000000           0         9         9 access
  0.00    0.000000           0       334        10 madvise
  0.00    0.000000           0         1           socket
  0.00    0.000000           0         1           bind
  0.00    0.000000           0         1           listen
  0.00    0.000000           0         1           setsockopt
  0.00    0.000000           0         3           clone
  0.00    0.000000           0         1           execve
  0.00    0.000000           0         2           fcntl
  0.00    0.000000           0         1           getcwd
  0.00    0.000000           0         2         2 mkdir
  0.00    0.000000           0         1         1 readlink
  0.00    0.000000           0         2           getrlimit
  0.00    0.000000           0         2           sigaltstack
  0.00    0.000000           0         1           arch_prctl
  0.00    0.000000           0         3           sched_getaffinity
  0.00    0.000000           0         1           set_tid_address
  0.00    0.000000           0        12           epoll_ctl
  0.00    0.000000           0         1           set_robust_list
  0.00    0.000000           0         1           epoll_create1
  0.00    0.000000           0         1           pipe2
  0.00    0.000000           0         3           getrandom
------ ----------- ----------- --------- --------- ----------------
100.00    0.035290                382632       194 total

gpu ~/src/deno> strace -c out/after/deno tests/http_bench.ts --allow-net
Listening on 127.0.0.1:4500
^Cstrace: Process 21334 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00    0.029291           0    467387      7144 futex
  0.00    0.000000           0        18           read
  0.00    0.000000           0         1           write
  0.00    0.000000           0        14           open
  0.00    0.000000           0        25           close
  0.00    0.000000           0         2           stat
  0.00    0.000000           0        13           fstat
  0.00    0.000000           0       146           mmap
  0.00    0.000000           0       723           mprotect
  0.00    0.000000           0       115           munmap
  0.00    0.000000           0         2           brk
  0.00    0.000000           0         5           rt_sigaction
  0.00    0.000000           0         1           rt_sigprocmask
  0.00    0.000000           0         3         1 ioctl
  0.00    0.000000           0         9         9 access
  0.00    0.000000           0       137        10 madvise
  0.00    0.000000           0         1           socket
  0.00    0.000000           0         1           bind
  0.00    0.000000           0         1           listen
  0.00    0.000000           0         1           setsockopt
  0.00    0.000000           0         3           clone
  0.00    0.000000           0         1           execve
  0.00    0.000000           0         2           fcntl
  0.00    0.000000           0         1           getcwd
  0.00    0.000000           0         2         2 mkdir
  0.00    0.000000           0         1         1 readlink
  0.00    0.000000           0         2           getrlimit
  0.00    0.000000           0         2           sigaltstack
  0.00    0.000000           0         1           arch_prctl
  0.00    0.000000           0         3           sched_getaffinity
  0.00    0.000000           0         1           set_tid_address
  0.00    0.000000           0        12           epoll_ctl
  0.00    0.000000           0         1           set_robust_list
  0.00    0.000000           0         1           epoll_create1
  0.00    0.000000           0         1           pipe2
  0.00    0.000000           0         3           getrandom
------ ----------- ----------- --------- --------- ----------------
100.00    0.029291                468642      7167 total

@kevinkassimo
Copy link
Contributor

kevinkassimo commented Oct 17, 2018

Probably we need a benchmark dedicated to futex. Looks pretty bad

@ztplz
Copy link
Contributor

ztplz commented Oct 17, 2018

@ry
Copy link
Member Author

ry commented Oct 17, 2018

@kevinkassimo futex is most likely from the std::sync::mpsc we use in the main thread. It looks bad, and I'm not sure it's the best design, but I doubt this is the bottleneck. (I'd rather do epoll the main thread - but it will require making a custom Tokio runtime.)
Likely the bottleneck is flatbuffer serialization on the JS side. It allocates a new ArrayBuffer for each message, and this patch makes each poll serialize a message. I'm going to try to validate this hypothesis with the profiler before trying to fix it. More updates soon...

@ry
Copy link
Member Author

ry commented Oct 17, 2018

FYI Here's how I am profiling:

# Make sure we're only building release.
export DENO_BUILD_MODE=release
# Build deno and V8's d8.
./tools/build.py d8 deno
# Start the program we want to benchmark with --prof
./out/release/deno tests/http_bench.ts --allow-net --prof &
# Exercise it.
third_party/wrk/linux/wrk http://localhost:4500/
kill `pgrep deno`
# When supplying --prof, V8 will write a file in the current directory that looks 
# like this isolate-0x7fad98242400-v8.log
# To examine this file
D8_PATH=out/release/ ./third_party/v8/tools/linux-tick-processor isolate-0x7fad98242400-v8.log

Here is are the outputs for before and after this patch: https://gist.github.com/ry/4b9d6f62b60ac8d7324ea624b7bd26e6

@ry
Copy link
Member Author

ry commented Oct 17, 2018

Counterintuitively this patch makes the program spend more time in C++. Look at the summary:

# BEFORE
 [Summary]:
   ticks  total  nonlib   name
   1780   20.3%   21.1%  JavaScript
   6606   75.3%   78.2%  C++
    373    4.3%    4.4%  GC
    325    3.7%          Shared libraries
     62    0.7%          Unaccounted

# AFTER
 [Summary]:
   ticks  total  nonlib   name
    985    9.6%    9.8%  JavaScript
   8981   87.1%   89.7%  C++
    391    3.8%    3.9%  GC
    296    2.9%          Shared libraries
     46    0.4%          Unaccounted

This might be because V8 is able to optimize better when the event loop is in JS.

Examining where time is spent in C++

# BEFORE
 [C++ entry points]:
   ticks    cpp   total   name
    895   19.2%   10.2%  v8::internal::Builtin_ArrayBufferConstructor(int, v8::internal::Object**, v8::internal::Isolate*)
    569   12.2%    6.5%  __pthread_cond_signal
    262    5.6%    3.0%  __lll_lock_wait
    252    5.4%    2.9%  Builtins_TypedArrayPrototypeSubArray

# AFTER
 [C++ entry points]:
   ticks    cpp   total   name
   3344   37.9%   32.4%  __pthread_cond_wait
    863    9.8%    8.4%  v8::internal::Builtin_HandleApiCall(int, v8::internal::Object**, v8::internal::Isolate*)
    789    8.9%    7.7%  __pthread_cond_signal
    637    7.2%    6.2%  v8::internal::Builtin_ArrayBufferConstructor(int, v8::internal::Object**, v8::internal::Isolate*)
    259    2.9%    2.5%  Builtins_MapPrototypeSet

32% of the time is spent in __pthread_cond_wait after the patch. Surely this is the std::sync::mpsc in the main thread waiting for a response. A ridiculous place to spend time when we're performing non-blocking reads/writes. The read/writes should be performed on the main thread - there does not need to be cross-thread communication in this example. We need to move back to tokio::runtime::current_thread.

In both profiles v8::internal::Builtin_ArrayBufferConstructor is called very often. This from flatbuffer.Builder. I think a patch to adjust flatbuffer.Builder to reuse the same ArrayBuffer would alleviate this. Update: I've landed these optimizations in #1009 and #1007

@ry ry mentioned this pull request Oct 17, 2018
@ry
Copy link
Member Author

ry commented Oct 18, 2018

Here is what an strace on tests/http_bench.ts looks like currently (on master)
https://gist.github.com/ry/9f061a4daf0f76e218c191714a96561e
takeaway: we need to be doing the accept/recvfrom/sendfrom on the main thread. far far too many futex calls for what we're doing.
mprotect is V8. I don't think we can change that.

@ry
Copy link
Member Author

ry commented Oct 18, 2018

Here's a little script I'm using, in case this helps others look into this stuff.

> cat profile_http.sh
#!/bin/bash
deno_exe=$1
if [[ ! -x "$deno_exe" ]]; then
  echo $deno_exe is not executable.
  exit 1
fi
$deno_exe tests/http_bench.ts --allow-net --prof &
sleep 1
third_party/wrk/linux/wrk http://localhost:4500/
kill `pgrep deno`

libdeno.runMicrotasks();
}
}
return promiseErrorExaminer();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, does this mean that we will only handle promise uncaught rejects when deno is about to exit?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just what passes the tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. I think with this impl deno would not promptly die on uncaught errors after running to the end.

Which one was the original failing test?
(quite busy in the coming weeks but might as well take a look)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tests/async_error.ts

}

export function eventLoop(): boolean {
for (;;) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I always liked while (true) better for infinite loops, feels more intentional to me, but...

@ry
Copy link
Member Author

ry commented Oct 25, 2018

Closing this one. It resulted in profiling and ongoing perf fixes. The ultimate goal is to move the loop to JS - but we're not ready to do it yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants