benchmarking script to time the different asyncio write options #2179

arthurdarcet · 2017-08-08T13:13:00Z

As discussed in #2126, I tried putting together a small benchmarking script to test the different write options in http_writer and their impacts on different workloads.

The script is in the PR (just a convenient way to link the file, i'm not sure at all it should be merged)
My results are bellow, I'm running a _UnixSelectorEventLoop: it would be interesting to get results from other configs.

The use cases I used are:

No body (only a few bytes representing the headers)
A very small body (10B) in 2 chunks (headers + body)
A 1KB body, in 2 or 10 chunks
A 1MB body, in 2 or 10 chunks
A 1GB body, in 2 or 10 chunks
The really useful numbers here are the 2 chunks: this is what is happening when you have a server writing the headers first, then the body in one go.

The current implementation is waiting to join the headers with the first chunk of the body, so it correspond to the b''.join lines.
With #2126, the headers would be send in a separate Transport.write call, so it corresponds to the multiple writes lines.
The bytearray lines are just for comparaison, trying to see where it would be preferable to b''.join

At lastly, the "10 chunks" lines are just to see how fast performance is degrading. If I understand things correctly, the http_writer.PayloadWriter class only does buffering until the transport is set. So, keeping aside the headers chunk, we only get multiple chunks here when http pipelining is happening and the user application does a lot a separate write in the handler. So imo, the only use case that really matters is 2 chunks, one for the headers, and one for the body. @fafhrd91 can you confirm this?

Here what i take from those results:

With a body size bellow a 1MB, the tests are inconclusive, the std dev is way too high. I think we can assume the second syscall does not impact the results in anyway but I might be missing something.
Above 1MB, the multiple write option is of course faster, since it is avoiding a ~large copy.
The 10 chunks lines shows that the multiple writes option is slower until the body exceed 1MB. For smaller body, this (very specific) use case is 3x slower in the Slow request body copy #2126 against the current implementation.

size/chunks	Write option	Mean	Std dev	loops	Variation
0B / 0	b''.join	5.45µs	7.94µs	36706
	bytearray	5.60µs	5.87µs	35695	2.83%
	multiple writes	4.55µs	5.88µs	43916	-16.42%
10B / 1	b''.join	4.55µs	7.44µs	43920
	bytearray	5.76µs	8.92µs	34750	26.39%
	multiple writes	6.54µs	8.94µs	30564	43.70%
1KB / 1	b''.join	6.65µs	26.31µs	30092
	bytearray	7.91µs	24.50µs	25283	19.02%
	multiple writes	7.45µs	7.64µs	26835	12.14%
1MB / 1	b''.join	1.18ms	616.71µs	170
	bytearray	1.15ms	503.26µs	175	-3.10%
	multiple writes	984.20µs	441.12µs	204	-16.76%
1GB / 1	b''.join	2.25s	0	1
	bytearray	1.80s	0	1	-19.76%
	multiple writes	665.09ms	0	1	-70.42%
6GB / 1	b''.join	21.27s	0	1
	bytearray	19.76s	0	1	-7.09%
	multiple writes	8.87s	0	1	-58.30%
10B / 10	b''.join	5.60µs	8.49µs	35704
	bytearray	7.32µs	24.33µs	27317	30.71%
	multiple writes	22.90µs	4.34µs	8733	308.88%
1KB / 10	b''.join	6.84µs	8.46µs	29240
	bytearray	9.92µs	8.33µs	20160	45.04%
	multiple writes	25.34µs	21.29µs	7894	270.46%
1MB / 10	b''.join	1.17ms	619.23µs	171
	bytearray	1.79ms	540.51µs	112	52.41%
	multiple writes	1.05ms	441.37µs	192	-10.81%
1GB / 10	b''.join	3.25s	0	1
	bytearray	1.93s	0	1	-40.65%
	multiple writes	689.35ms	0	1	-78.77%
6GB / 5	b''.join	22.15s	0	1
	bytearray	19.55s	0	1	-11.77%
	multiple writes	8.59s	0	1	-61.22%

codecov-io · 2017-08-08T13:40:23Z

Codecov Report

Merging #2179 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master    #2179   +/-   ##
=======================================
  Coverage   97.12%   97.12%           
=======================================
  Files          39       39           
  Lines        7884     7884           
  Branches     1366     1366           
=======================================
  Hits         7657     7657           
  Misses        101      101           
  Partials      126      126

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update dda632b...d561b8f. Read the comment docs.

fafhrd91 · 2017-08-08T14:02:38Z

That is awesome! we can optimize writer according this benches.

Btw I have similar result for async-tokio and one of the reason why uvloop faster.

fafhrd91 · 2017-08-08T15:03:12Z

At lastly, the "10 chunks" lines are just to see how fast performance is degrading. If I understand things correctly, the http_writer.PayloadWriter class only does buffering until the transport is set. So, keeping aside the headers chunk, we only get multiple chunks here when http pipelining is happening and the user application does a lot a separate write in the handler. So imo, the only use case that really matters is 2 chunks, one for the headers, and one for the body. @fafhrd91 <https://github.com/fafhrd91> can you confirm this?

Yes, that is right. And that is only case when we make decision. I’d simplify this but that should happen on web response level. Also consider #2109, it can reduce StreamWriter and PayloadWriter implementations. Subsequent writes do not really matter because developer can optimize writes.

kxepal · 2017-11-06T17:50:31Z

Interesting. So, ideally, we can choose the best algorithm depending on data size to avoid unwanted overhead, right?

arthurdarcet · 2017-11-06T18:13:51Z

In theory, probably. But in practice, imo, it would make the code way to complicated for almost no improvements.

The use case are as follow:

Transport is set, headers are sent, body is sent (in one chunk.)
Headers are sent, transport is sent, body is sent.
Headers are sent, body is sent, transport is set.

use case 3 being probably way more rare that the other two.
And the body can range from 0 byte, to …

So [everywhere in this I am using "body" to really mean "first chunk of the body written by the user", but I think in most cases it's actually the only chunk so the difference does not matter]:

The current implementation always buffer the headers, then wait for the body, then join the headers and the body and wait for the transport, then send everything in one syscall.
The simplest solution is buffering headers and body until the transport is set, and writing each with a different syscall (either as soon as they are sent or when the transport is set, depending on which comes first). It keeps the code in PayloadWriter very simple but it does mean doing two syscalls for every use case, which probably waste a few nano/micro seconds (we could be buffering instead and then b''.join(…)-ing the buffer, then calling transport.write only once.
We could also always buffer the headers until the both the body is sent and the transport is set, then if the body is small enough, we join the headers with the body python-side and do only one syscall ; or we do two syscalls directly.
This option feels a bit too much like by-passing the asyncio/kernel optimisations to me: how much data should be buffered/sent in one chunk should be decided by the kernel imo.

Option 1 is the current implementation ; option 2 is #2126 ; option 3 could be added to #2126 if you think it's a necessary optimisation.

asvetlov · 2019-10-19T10:41:36Z

Worth to have it, the benchmark is still relevant

benchmarking script to time the different asyncio write options

d561b8f

arthurdarcet mentioned this pull request Aug 8, 2017

Slow request body copy #2126

Merged

webknjaz closed this Jul 29, 2019

webknjaz reopened this Jul 29, 2019

asvetlov merged commit b3cab9e into aio-libs:master Oct 19, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

benchmarking script to time the different asyncio write options #2179

benchmarking script to time the different asyncio write options #2179

arthurdarcet commented Aug 8, 2017

codecov-io commented Aug 8, 2017 •

edited

Loading

fafhrd91 commented Aug 8, 2017

fafhrd91 commented Aug 8, 2017 via email

kxepal commented Nov 6, 2017

arthurdarcet commented Nov 6, 2017

asvetlov commented Oct 19, 2019

benchmarking script to time the different asyncio write options #2179

benchmarking script to time the different asyncio write options #2179

Conversation

arthurdarcet commented Aug 8, 2017

codecov-io commented Aug 8, 2017 • edited Loading

Codecov Report

fafhrd91 commented Aug 8, 2017

fafhrd91 commented Aug 8, 2017 via email

kxepal commented Nov 6, 2017

arthurdarcet commented Nov 6, 2017

asvetlov commented Oct 19, 2019

codecov-io commented Aug 8, 2017 •

edited

Loading