Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

benchmarking script to time the different asyncio write options #2179

Merged
merged 1 commit into from
Oct 19, 2019

Conversation

arthurdarcet
Copy link
Contributor

As discussed in #2126, I tried putting together a small benchmarking script to test the different write options in http_writer and their impacts on different workloads.

The script is in the PR (just a convenient way to link the file, i'm not sure at all it should be merged)
My results are bellow, I'm running a _UnixSelectorEventLoop: it would be interesting to get results from other configs.

The use cases I used are:

  • No body (only a few bytes representing the headers)
  • A very small body (10B) in 2 chunks (headers + body)
  • A 1KB body, in 2 or 10 chunks
  • A 1MB body, in 2 or 10 chunks
  • A 1GB body, in 2 or 10 chunks
    The really useful numbers here are the 2 chunks: this is what is happening when you have a server writing the headers first, then the body in one go.

The current implementation is waiting to join the headers with the first chunk of the body, so it correspond to the b''.join lines.
With #2126, the headers would be send in a separate Transport.write call, so it corresponds to the multiple writes lines.
The bytearray lines are just for comparaison, trying to see where it would be preferable to b''.join

At lastly, the "10 chunks" lines are just to see how fast performance is degrading. If I understand things correctly, the http_writer.PayloadWriter class only does buffering until the transport is set. So, keeping aside the headers chunk, we only get multiple chunks here when http pipelining is happening and the user application does a lot a separate write in the handler. So imo, the only use case that really matters is 2 chunks, one for the headers, and one for the body. @fafhrd91 can you confirm this?

Here what i take from those results:

  • With a body size bellow a 1MB, the tests are inconclusive, the std dev is way too high. I think we can assume the second syscall does not impact the results in anyway but I might be missing something.
  • Above 1MB, the multiple write option is of course faster, since it is avoiding a ~large copy.
  • The 10 chunks lines shows that the multiple writes option is slower until the body exceed 1MB. For smaller body, this (very specific) use case is 3x slower in the Slow request body copy #2126 against the current implementation.
size/chunks Write option Mean Std dev loops Variation
0B / 0 b''.join 5.45µs 7.94µs 36706
bytearray 5.60µs 5.87µs 35695 2.83%
multiple writes 4.55µs 5.88µs 43916 -16.42%
10B / 1 b''.join 4.55µs 7.44µs 43920
bytearray 5.76µs 8.92µs 34750 26.39%
multiple writes 6.54µs 8.94µs 30564 43.70%
1KB / 1 b''.join 6.65µs 26.31µs 30092
bytearray 7.91µs 24.50µs 25283 19.02%
multiple writes 7.45µs 7.64µs 26835 12.14%
1MB / 1 b''.join 1.18ms 616.71µs 170
bytearray 1.15ms 503.26µs 175 -3.10%
multiple writes 984.20µs 441.12µs 204 -16.76%
1GB / 1 b''.join 2.25s 0 1
bytearray 1.80s 0 1 -19.76%
multiple writes 665.09ms 0 1 -70.42%
6GB / 1 b''.join 21.27s 0 1
bytearray 19.76s 0 1 -7.09%
multiple writes 8.87s 0 1 -58.30%
10B / 10 b''.join 5.60µs 8.49µs 35704
bytearray 7.32µs 24.33µs 27317 30.71%
multiple writes 22.90µs 4.34µs 8733 308.88%
1KB / 10 b''.join 6.84µs 8.46µs 29240
bytearray 9.92µs 8.33µs 20160 45.04%
multiple writes 25.34µs 21.29µs 7894 270.46%
1MB / 10 b''.join 1.17ms 619.23µs 171
bytearray 1.79ms 540.51µs 112 52.41%
multiple writes 1.05ms 441.37µs 192 -10.81%
1GB / 10 b''.join 3.25s 0 1
bytearray 1.93s 0 1 -40.65%
multiple writes 689.35ms 0 1 -78.77%
6GB / 5 b''.join 22.15s 0 1
bytearray 19.55s 0 1 -11.77%
multiple writes 8.59s 0 1 -61.22%

@codecov-io
Copy link

codecov-io commented Aug 8, 2017

Codecov Report

Merging #2179 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #2179   +/-   ##
=======================================
  Coverage   97.12%   97.12%           
=======================================
  Files          39       39           
  Lines        7884     7884           
  Branches     1366     1366           
=======================================
  Hits         7657     7657           
  Misses        101      101           
  Partials      126      126

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update dda632b...d561b8f. Read the comment docs.

@fafhrd91
Copy link
Member

fafhrd91 commented Aug 8, 2017

That is awesome! we can optimize writer according this benches.

Btw I have similar result for async-tokio and one of the reason why uvloop faster.

@fafhrd91
Copy link
Member

fafhrd91 commented Aug 8, 2017 via email

@kxepal
Copy link
Member

kxepal commented Nov 6, 2017

Interesting. So, ideally, we can choose the best algorithm depending on data size to avoid unwanted overhead, right?

@arthurdarcet
Copy link
Contributor Author

In theory, probably. But in practice, imo, it would make the code way to complicated for almost no improvements.

The use case are as follow:

  1. Transport is set, headers are sent, body is sent (in one chunk.)
  2. Headers are sent, transport is sent, body is sent.
  3. Headers are sent, body is sent, transport is set.

use case 3 being probably way more rare that the other two.
And the body can range from 0 byte, to …

So [everywhere in this I am using "body" to really mean "first chunk of the body written by the user", but I think in most cases it's actually the only chunk so the difference does not matter]:

  1. The current implementation always buffer the headers, then wait for the body, then join the headers and the body and wait for the transport, then send everything in one syscall.

  2. The simplest solution is buffering headers and body until the transport is set, and writing each with a different syscall (either as soon as they are sent or when the transport is set, depending on which comes first). It keeps the code in PayloadWriter very simple but it does mean doing two syscalls for every use case, which probably waste a few nano/micro seconds (we could be buffering instead and then b''.join(…)-ing the buffer, then calling transport.write only once.

  3. We could also always buffer the headers until the both the body is sent and the transport is set, then if the body is small enough, we join the headers with the body python-side and do only one syscall ; or we do two syscalls directly.
    This option feels a bit too much like by-passing the asyncio/kernel optimisations to me: how much data should be buffered/sent in one chunk should be decided by the kernel imo.

Option 1 is the current implementation ; option 2 is #2126 ; option 3 could be added to #2126 if you think it's a necessary optimisation.

@webknjaz webknjaz closed this Jul 29, 2019
@webknjaz webknjaz reopened this Jul 29, 2019
@asvetlov asvetlov merged commit b3cab9e into aio-libs:master Oct 19, 2019
@asvetlov
Copy link
Member

Worth to have it, the benchmark is still relevant

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants