-
Notifications
You must be signed in to change notification settings - Fork 178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Summary of changes in this PR: #1034
Conversation
Looking at the numbers in the Notion doc (taking the peak reported Gbps for the sake of having a single-number comparison), it looks like MbedTLS might be slowed down by these changes (on test64, test256, test1) -- any ideas why that might be?
|
b0007bb
to
8a2894f
Compare
Codecov Report
@@ Coverage Diff @@
## master #1034 +/- ##
==========================================
+ Coverage 85.43% 85.48% +0.04%
==========================================
Files 37 32 -5
Lines 3063 3018 -45
==========================================
- Hits 2617 2580 -37
+ Misses 446 438 -8
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
a420aa1
to
ee6fe6b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a couple of notes, for now, I think this looks good so far! I think the change to use LIFO in the connection pools will serve us really well under heavy loads but I wonder if there are circumstances in which it will struggle a bit with some connections getting stale/underutilized. Not an expert in this area by any means...
c55ae23
to
51e7160
Compare
272d575
to
bab6a55
Compare
* Finally achieving my dream of moving the connection pool out of HTTP; it's going to live in the [ConcurrentUtilities.jl](JuliaServices/ConcurrentUtilities.jl#18) package instead. In short, it had no unit tests, scant documentation, and was generally a pain to deal with in HTTP. We also discovered at least 2 major issues with the current implementation during a deep dive into performance and issue diagnosis, including: * If a ConnectError was thrown while creating a connection, a "permit" to the pool was permanently lost; get enough of them and the entire connection pool grinds to a halt :grimace: * Being a threadsafe structure, the pool had the poor choice of holding the pool lock for the duration of trying to get a permit _and making new connections_. This meant that in the case of a connection taking a long time to be made, we were essentially holding the rest of the pool hostage. This is totally unnecessary and can cause big performance issues in really heavy workloads where there's lots of contention on the pool for managing requests. The new implementation in ConcurrentUtilities.jl solves both these problems while being about 1/4th the LOC of the previous implementation. And it has unit tests! yay! All in all, I think #1033 and #1032 should both be mostly resolved by these changes/updates. * Relatedly, we're adjusting the API for connection pools to allow the user to pass in their _own_ connection pool to be used for that request (to check for a connection to reuse and to return the connection to after completion). A pool can be constructed like `HTTP.Pool(; max::Int)` and passed to any of the `HTTP.request` methods like `HTTP.get(...; pool=pool)`. HTTP has its own global default pool `HTTP.Connections.POOL` that it uses by default to manage connection reuse. The `HTTP.set_default_connection_limit!` will still work as long as it is called before any requests are made. Calling it _after_ requests have been made will be a no-op. The `connection_limit` keyword arg is now formally deprecated and will issue a warning if passed. I'm comfortable with a full deprecation here because it turns out it wasn't even really working before anyway (unless it was passed/used on _every_ request and never changed). So instead of "changing" things, we're really just doing a proper implementation that now actually works, has better behavior, and is actually controllable by the user. * Add a try-finally in keepalive! around our global IO lock usage just for good house-keeping * Refactored `try_with_timeout` to use a `Channel` instead of the non-threaded `@async`; it's much simpler and seems cleaner * I refactored a few of the stream IO functions so that we always know the number of bytes downloaded, whether in memory or written to an IO, so we can log them and use them in verbose logging to give bit-rate calculations * Added a new `logerrors::Bool=false` keyword arg that allows doing `@error` logs on errors that may otherwise be "swallowed" when doing retries; it can be helpful to sometimes be able to at least see what kinds of errors are happening; also cleaned up our error handling in general so we don't lose backtraces which fixes #1003. * Added lots of metrics around various time spent in various layers, read vs. write durations, etc. These can be enabled, and stored in the request context, by passing `observelayers=true` This mostly resolves #1025 and #1019. * Fixed some missing keyword args that either weren't correct in the inline docs or weren't included in the client.md docs * Removed a few client-side layers (BasicAuth, DebugRequest, Canonicalize, etc.) since their implementations were _really_ trivial and moved their functionality into a single HeadersRequest layer where we now do all the header-related work during the request. This has the affect of _drastically_ reducing the exploding backtraces we now have when doing requests because of our "functional style" client-side layering. * Fixed #612 by ensuring `keepalive` is included in our `connectionkey` computation so only connections that specified `keepalive` will be reused if its passed the same value on subsequent requests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks good:) But I'd like to review it again once we get the CI green the ConcurrentUtilities.jl
updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lovely stuff -- left some mostly cosmetic comments
|
||
A dict of global connection pools keeping track of active connections, split by their IO type. | ||
Connection pool for managing the reuse of HTTP connections. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we maybe name this ConnectionPool
(rather than Pool
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a good thought, but this module actually used to be named ConnectionPool and I changed it to Connections
, but then I was worried about possible breakage if people were using the HTTP.ConnectionPool
submodule, so I put back an alias const ConnectionPool = Connections
, since a juliahub search showed that there are some packages out there directly referencing it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh i meant the type, not the module
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, that's the problem is the module is already named ConnectionPool
so we can't name the type that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exciting improvement! Any results to share about the PrefetchedDownloadStream
?
- `retry = true`, retry idempotent requests in case of error. | ||
- `retries = 4`, number of times to retry. | ||
- `retry_non_idempotent = false`, retry non-idempotent requests too. e.g. POST. | ||
- `retry_delay = ExponentialBackOff(n = retries)`, provide a custom `ExponentialBackOff` object to control the delay between retries. | ||
- `retry_delays = ExponentialBackOff(n = retries)`, provide a custom `ExponentialBackOff` object to control the delay between retries. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to force length(retry_delays) == retries
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that sounds like a good idea. Or if you provide retry_delays
, then that will override the retries
keyword arg?
req.context[:write_duration_ms] = get(req.context, :write_duration_ms, 0.0) + ((time() - write_start) * 1000) | ||
end | ||
read_start = time() | ||
@async try |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume the @async
s here have to do with today's discussion about OpenSSL thread pinning requirement. Since we already specialize on the type of IO in the Stream
, do we want to special case the IO that don't need @async
here and could migrate?
Anyway a comment about the @async
usage would be helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, the @async
s here are actually harmless and just are meant to allow the writing/reading to happen simultaneously. It's a good question though about whether these should be @spawn
or whether that makes a difference if we're already in an @async
task.
Summary of changes:
package instead. In short, it had no unit tests, scant documentation, and was generally a pain to deal with in HTTP. We also discovered at least 2 major issues with the current
implementation during a deep dive into performance and issue diagnosis, including:
case of a connection taking a long time to be made, we were essentially holding the rest of the pool hostage. This is totally unnecessary and can cause big performance issues in
really heavy workloads where there's lots of contention on the pool for managing requests.
The new implementation in ConcurrentUtilities.jl solves both these problems while being about 1/4th the LOC of the previous implementation. And it has unit tests! yay!
All in all, I think connection_limit parameter ignored after request is called for the first time #1033 and Investigate single-threaded, async HTTP.request performance #1032 should both be mostly resolved by these changes/updates.
return the connection to after completion). A pool can be constructed like
HTTP.Pool(; max::Int)
and passed to any of theHTTP.request
methods likeHTTP.get(...; pool=pool)
.HTTP has its own global default pool
HTTP.Connections.POOL
that it uses by default to manage connection reuse. TheHTTP.set_default_connection_limit!
will still work as long as itis called before any requests are made. Calling it after requests have been made will be a no-op. The
connection_limit
keyword arg is now formally deprecated and will issue a warningif passed. I'm comfortable with a full deprecation here because it turns out it wasn't even really working before anyway (unless it was passed/used on every request and never changed).
So instead of "changing" things, we're really just doing a proper implementation that now actually works, has better behavior, and is actually controllable by the user.
try_with_timeout
to use aChannel
instead of the non-threaded@async
; it's much simpler and seems cleaneran IO, so we can log them and use them in verbose logging to give bit-rate calculations
logerrors::Bool=false
keyword arg that allows doing@error
logs on errors that may otherwise be "swallowed" when doing retries;it can be helpful to sometimes be able to at least see what kinds of errors are happening; also cleaned up our error handling in general so we don't lose backtraces which fixes Store original error catch_backtrace() in RequestError #1003.
observelayers=true
This mostly resolves Establish connection / TTFB / DNS lookup time information #1025 and Report number of retries when throwing StatusError that involved them #1019.
layer where we now do all the header-related work during the request. This has the affect of drastically reducing the exploding backtraces we now have when doing requests because of
our "functional style" client-side layering.
keepalive
is included in ourconnectionkey
computation so only connections that specifiedkeepalive
will be reused if its passed the same value on subsequent requests