Summary of changes in this PR: #1034

quinnj · 2023-04-11T05:57:27Z

Summary of changes:

Finally achieving my dream of moving the connection pool out of HTTP; it's going to live in the ConcurrentUtilities.jl
package instead. In short, it had no unit tests, scant documentation, and was generally a pain to deal with in HTTP. We also discovered at least 2 major issues with the current
implementation during a deep dive into performance and issue diagnosis, including:
- If a ConnectError was thrown while creating a connection, a "permit" to the pool was permanently lost; get enough of them and the entire connection pool grinds to a halt :grimace:
- Being a threadsafe structure, the pool had the poor choice of holding the pool lock for the duration of trying to get a permit and making new connections. This meant that in the
  case of a connection taking a long time to be made, we were essentially holding the rest of the pool hostage. This is totally unnecessary and can cause big performance issues in
  really heavy workloads where there's lots of contention on the pool for managing requests.
  The new implementation in ConcurrentUtilities.jl solves both these problems while being about 1/4th the LOC of the previous implementation. And it has unit tests! yay!
  All in all, I think connection_limit parameter ignored after request is called for the first time #1033 and Investigate single-threaded, async HTTP.request performance #1032 should both be mostly resolved by these changes/updates.
Relatedly, we're adjusting the API for connection pools to allow the user to pass in their own connection pool to be used for that request (to check for a connection to reuse and to
return the connection to after completion). A pool can be constructed like HTTP.Pool(; max::Int) and passed to any of the HTTP.request methods like HTTP.get(...; pool=pool).
HTTP has its own global default pool HTTP.Connections.POOL that it uses by default to manage connection reuse. The HTTP.set_default_connection_limit! will still work as long as it
is called before any requests are made. Calling it after requests have been made will be a no-op. The connection_limit keyword arg is now formally deprecated and will issue a warning
if passed. I'm comfortable with a full deprecation here because it turns out it wasn't even really working before anyway (unless it was passed/used on every request and never changed).
So instead of "changing" things, we're really just doing a proper implementation that now actually works, has better behavior, and is actually controllable by the user.
Add a try-finally in keepalive! around our global IO lock usage just for good house-keeping
Refactored try_with_timeout to use a Channel instead of the non-threaded @async; it's much simpler and seems cleaner
I refactored a few of the stream IO functions so that we always know the number of bytes downloaded, whether in memory or written to
an IO, so we can log them and use them in verbose logging to give bit-rate calculations
Added a new logerrors::Bool=false keyword arg that allows doing @error logs on errors that may otherwise be "swallowed" when doing retries;
it can be helpful to sometimes be able to at least see what kinds of errors are happening; also cleaned up our error handling in general so we don't lose backtraces which fixes Store original error catch_backtrace() in RequestError #1003.
Added lots of metrics around various time spent in various layers, read vs. write durations, etc. These can be enabled, and stored in the request context, by passing observelayers=true
This mostly resolves Establish connection / TTFB / DNS lookup time information #1025 and Report number of retries when throwing StatusError that involved them #1019.
Fixed some missing keyword args that either weren't correct in the inline docs or weren't included in the client.md docs
Removed a few client-side layers (BasicAuth, DebugRequest, Canonicalize, etc.) since their implementations were really trivial and moved their functionality into a single HeadersRequest
layer where we now do all the header-related work during the request. This has the affect of drastically reducing the exploding backtraces we now have when doing requests because of
our "functional style" client-side layering.
Fixed TCP keepalive is not respected when re-using connections #612 by ensuring keepalive is included in our connectionkey computation so only connections that specified keepalive will be reused if its passed the same value on subsequent requests

nickrobinson251 · 2023-04-11T09:56:02Z

the test1mb will download 5gb of data via 1mb files. test64 is 80 64mb files, test256 is 20 256mb files, test1 is 5 1gb files, and test1 is 1 5gb file.

Looking at the numbers in the Notion doc (taking the peak reported Gbps for the sake of having a single-number comparison), it looks like MbedTLS might be slowed down by these changes (on test64, test256, test1) -- any ideas why that might be?

test	MbedTLS-pre	MbedTLS-post	Relative change
1mb	5.033982621	6.555191426	1.302187934
64	5.533806929	5.144007198	0.9295602944
256	5.823905782	4.068797269	0.6986372069
1	5.441129421	4.570638241	0.8400164538
5	3.975655472	5.974447343	1.502757818

test	OpenSSL-pre	OpenSSL-post	Relative change
1mb	7.593859359	9.984355663	1.314793334
64	10.35584208	11.54993619	1.115306327
256	9.672000774	11.38522636	1.177132491
1	8.29096817	11.18302301	1.348819918
5	5.380764443	10.73849381	1.995718995

codecov-commenter · 2023-04-11T22:49:15Z

Codecov Report

Merging #1034 (309f8dd) into master (75c1b25) will increase coverage by 0.04%.
The diff coverage is 82.35%.

@@            Coverage Diff             @@
##           master    #1034      +/-   ##
==========================================
+ Coverage   85.43%   85.48%   +0.04%     
==========================================
  Files          37       32       -5     
  Lines        3063     3018      -45     
==========================================
- Hits         2617     2580      -37     
+ Misses        446      438       -8

Impacted Files	Coverage Δ
src/Messages.jl	`85.47% <ø> (ø)`
src/Servers.jl	`80.20% <ø> (ø)`
src/WebSockets.jl	`87.82% <ø> (ø)`
src/HTTP.jl	`67.05% <56.25%> (-2.95%)`	⬇️
src/clientlayers/HeadersRequest.jl	`85.36% <66.66%> (ø)`
src/Connections.jl	`84.98% <77.94%> (ø)`
src/clientlayers/ConnectionRequest.jl	`79.78% <80.00%> (+0.77%)`	⬆️
src/clientlayers/ExceptionRequest.jl	`90.90% <87.50%> (-9.10%)`	⬇️
src/clientlayers/TimeoutRequest.jl	`92.30% <90.90%> (-7.70%)`	⬇️
src/clientlayers/StreamRequest.jl	`96.55% <93.54%> (+1.61%)`	⬆️
... and 6 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Drvi

Just a couple of notes, for now, I think this looks good so far! I think the change to use LIFO in the connection pools will serve us really well under heavy loads but I wonder if there are circumstances in which it will struggle a bit with some connections getting stale/underutilized. Not an expert in this area by any means...

src/clientlayers/RetryRequest.jl

src/clientlayers/StreamRequest.jl

src/connectionpools.jl

* Finally achieving my dream of moving the connection pool out of HTTP; it's going to live in the [ConcurrentUtilities.jl](JuliaServices/ConcurrentUtilities.jl#18) package instead. In short, it had no unit tests, scant documentation, and was generally a pain to deal with in HTTP. We also discovered at least 2 major issues with the current implementation during a deep dive into performance and issue diagnosis, including: * If a ConnectError was thrown while creating a connection, a "permit" to the pool was permanently lost; get enough of them and the entire connection pool grinds to a halt :grimace: * Being a threadsafe structure, the pool had the poor choice of holding the pool lock for the duration of trying to get a permit _and making new connections_. This meant that in the case of a connection taking a long time to be made, we were essentially holding the rest of the pool hostage. This is totally unnecessary and can cause big performance issues in really heavy workloads where there's lots of contention on the pool for managing requests. The new implementation in ConcurrentUtilities.jl solves both these problems while being about 1/4th the LOC of the previous implementation. And it has unit tests! yay! All in all, I think #1033 and #1032 should both be mostly resolved by these changes/updates. * Relatedly, we're adjusting the API for connection pools to allow the user to pass in their _own_ connection pool to be used for that request (to check for a connection to reuse and to return the connection to after completion). A pool can be constructed like `HTTP.Pool(; max::Int)` and passed to any of the `HTTP.request` methods like `HTTP.get(...; pool=pool)`. HTTP has its own global default pool `HTTP.Connections.POOL` that it uses by default to manage connection reuse. The `HTTP.set_default_connection_limit!` will still work as long as it is called before any requests are made. Calling it _after_ requests have been made will be a no-op. The `connection_limit` keyword arg is now formally deprecated and will issue a warning if passed. I'm comfortable with a full deprecation here because it turns out it wasn't even really working before anyway (unless it was passed/used on _every_ request and never changed). So instead of "changing" things, we're really just doing a proper implementation that now actually works, has better behavior, and is actually controllable by the user. * Add a try-finally in keepalive! around our global IO lock usage just for good house-keeping * Refactored `try_with_timeout` to use a `Channel` instead of the non-threaded `@async`; it's much simpler and seems cleaner * I refactored a few of the stream IO functions so that we always know the number of bytes downloaded, whether in memory or written to an IO, so we can log them and use them in verbose logging to give bit-rate calculations * Added a new `logerrors::Bool=false` keyword arg that allows doing `@error` logs on errors that may otherwise be "swallowed" when doing retries; it can be helpful to sometimes be able to at least see what kinds of errors are happening; also cleaned up our error handling in general so we don't lose backtraces which fixes #1003. * Added lots of metrics around various time spent in various layers, read vs. write durations, etc. These can be enabled, and stored in the request context, by passing `observelayers=true` This mostly resolves #1025 and #1019. * Fixed some missing keyword args that either weren't correct in the inline docs or weren't included in the client.md docs * Removed a few client-side layers (BasicAuth, DebugRequest, Canonicalize, etc.) since their implementations were _really_ trivial and moved their functionality into a single HeadersRequest layer where we now do all the header-related work during the request. This has the affect of _drastically_ reducing the exploding backtraces we now have when doing requests because of our "functional style" client-side layering. * Fixed #612 by ensuring `keepalive` is included in our `connectionkey` computation so only connections that specified `keepalive` will be reused if its passed the same value on subsequent requests

Drvi

I think this looks good:) But I'd like to review it again once we get the CI green the ConcurrentUtilities.jl updated.

src/Connections.jl

src/HTTP.jl

src/Streams.jl

src/Connections.jl

quinnj · 2023-04-27T12:38:33Z

Ok, I've done a bunch of additional benchmarking, comparing lots of different variables, but at a high-level summary, the net effect of this (+ other webstack PRs (OpenSSL, CloudBase, CloudStore, ConcurrentUtilities)) is:

with more details broken out by specific cloud and file sizes:

So all in all, I'm feeling confident in how all these changes have settled.

nickrobinson251

lovely stuff -- left some mostly cosmetic comments

src/Connections.jl

nickrobinson251 · 2023-04-27T13:35:45Z

src/Connections.jl


-A dict of global connection pools keeping track of active connections, split by their IO type.
+Connection pool for managing the reuse of HTTP connections.


should we maybe name this ConnectionPool (rather than Pool)?

It's a good thought, but this module actually used to be named ConnectionPool and I changed it to Connections, but then I was worried about possible breakage if people were using the HTTP.ConnectionPool submodule, so I put back an alias const ConnectionPool = Connections, since a juliahub search showed that there are some packages out there directly referencing it.

oh i meant the type, not the module

Well, that's the problem is the module is already named ConnectionPool so we can't name the type that?

docs/src/client.md

src/HTTP.jl

src/clientlayers/MessageRequest.jl

src/clientlayers/StreamRequest.jl

Drvi

Exciting improvement! Any results to share about the PrefetchedDownloadStream?

Drvi · 2023-04-27T13:15:52Z

src/HTTP.jl

 - `retry = true`, retry idempotent requests in case of error.
 - `retries = 4`, number of times to retry.
 - `retry_non_idempotent = false`, retry non-idempotent requests too. e.g. POST.
- - `retry_delay = ExponentialBackOff(n = retries)`, provide a custom `ExponentialBackOff` object to control the delay between retries.
+ - `retry_delays = ExponentialBackOff(n = retries)`, provide a custom `ExponentialBackOff` object to control the delay between retries.


Do we want to force length(retry_delays) == retries?

Yeah, that sounds like a good idea. Or if you provide retry_delays, then that will override the retries keyword arg?

Drvi · 2023-04-27T15:21:18Z

src/clientlayers/StreamRequest.jl

+                    req.context[:write_duration_ms] = get(req.context, :write_duration_ms, 0.0) + ((time() - write_start) * 1000)
+                end
+                read_start = time()
+                @async try


I assume the @asyncs here have to do with today's discussion about OpenSSL thread pinning requirement. Since we already specialize on the type of IO in the Stream, do we want to special case the IO that don't need @async here and could migrate?

Anyway a comment about the @async usage would be helpful.

Well, the @asyncs here are actually harmless and just are meant to allow the writing/reading to happen simultaneously. It's a good question though about whether these should be @spawn or whether that makes a difference if we're already in an @async task.

quinnj force-pushed the jq-work branch 3 times, most recently from b0007bb to 8a2894f Compare April 11, 2023 22:38

quinnj force-pushed the jq-work branch 7 times, most recently from a420aa1 to ee6fe6b Compare April 12, 2023 01:13

Drvi reviewed Apr 12, 2023

View reviewed changes

src/clientlayers/RetryRequest.jl Show resolved Hide resolved

src/clientlayers/StreamRequest.jl Outdated Show resolved Hide resolved

src/connectionpools.jl Outdated Show resolved Hide resolved

src/connectionpools.jl Outdated Show resolved Hide resolved

quinnj force-pushed the jq-work branch 3 times, most recently from c55ae23 to 51e7160 Compare April 19, 2023 04:28

quinnj force-pushed the jq-work branch 2 times, most recently from 272d575 to bab6a55 Compare April 20, 2023 22:10

quinnj force-pushed the jq-work branch from bab6a55 to 639469b Compare April 20, 2023 22:30

Drvi reviewed Apr 21, 2023

View reviewed changes

src/Connections.jl Outdated Show resolved Hide resolved

src/Connections.jl Outdated Show resolved Hide resolved

src/HTTP.jl Outdated Show resolved Hide resolved

src/Streams.jl Outdated Show resolved Hide resolved

src/Connections.jl Show resolved Hide resolved

quinnj added 2 commits April 21, 2023 13:21

Update based on new Pool chnages

6b04184

Updates

41ef56d

quinnj added 2 commits April 27, 2023 06:59

cleanup

9374cdc

Put back in exception unwrapping

6fed5e3

nickrobinson251 approved these changes Apr 27, 2023

View reviewed changes

Drvi reviewed Apr 27, 2023

View reviewed changes

quinnj mentioned this pull request Apr 27, 2023

Add tests around the various read functions for HTTP.Connection/HTTP.Stream #1037

Open

Address PR review

309f8dd

quinnj merged commit b2c3d8f into master Apr 27, 2023

quinnj deleted the jq-work branch April 27, 2023 16:23

This was referenced Apr 27, 2023

connection_limit parameter ignored after request is called for the first time #1033

Closed

Investigate single-threaded, async HTTP.request performance #1032

Closed

mkitti mentioned this pull request Apr 27, 2023

Enable async chunk reading JuliaIO/Zarr.jl#106

Merged

Seelengrab mentioned this pull request Sep 23, 2023

Drop mbedTLS and migrate to OpenSSL JuliaLang/julia#48799

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Summary of changes in this PR: #1034

Summary of changes in this PR: #1034

quinnj commented Apr 11, 2023 •

edited

Loading

nickrobinson251 commented Apr 11, 2023

codecov-commenter commented Apr 11, 2023 •

edited

Loading

Drvi left a comment

Drvi left a comment •

edited

Loading

quinnj commented Apr 27, 2023

nickrobinson251 left a comment

nickrobinson251 Apr 27, 2023

quinnj Apr 27, 2023

nickrobinson251 Apr 27, 2023

quinnj Apr 27, 2023

Drvi left a comment

Drvi Apr 27, 2023

quinnj Apr 27, 2023

Drvi Apr 27, 2023

quinnj Apr 27, 2023


		A dict of global connection pools keeping track of active connections, split by their IO type.
		Connection pool for managing the reuse of HTTP connections.

Summary of changes in this PR: #1034

Summary of changes in this PR: #1034

Conversation

quinnj commented Apr 11, 2023 • edited Loading

nickrobinson251 commented Apr 11, 2023

codecov-commenter commented Apr 11, 2023 • edited Loading

Codecov Report

Drvi left a comment

Choose a reason for hiding this comment

Drvi left a comment • edited Loading

Choose a reason for hiding this comment

quinnj commented Apr 27, 2023

nickrobinson251 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Drvi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

quinnj commented Apr 11, 2023 •

edited

Loading

codecov-commenter commented Apr 11, 2023 •

edited

Loading

Drvi left a comment •

edited

Loading