Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix timeouts firing while tarballs are extracted #6130

Merged
merged 3 commits into from
Nov 6, 2018

Conversation

alexcrichton
Copy link
Member

This commit fixes #6125 by ensuring that while we're extracting tarballs
or doing other synchronous work like grabbing file locks we're not
letting the timeout timers of each HTTP transfer keep ticking. This is
curl's default behavior (which we don't want in this scenario). Instead
the timeout logic is inlined directly and we manually account for the
synchronous work happening not counting towards timeout limits.

Closes #6125

@rust-highfive
Copy link

r? @matklad

(rust_highfive has picked a reviewer for you, use r? to override)

@alexcrichton
Copy link
Member Author

r? @ehuss

@SimonSapin
Copy link
Contributor

Can this synchronous work cause Cargo to pause reading bytes from a TCP socket that is in the middle of fetching another HTTP response? Not timing out unnecessarily in the client is good, but it’d also be good if we don’t give servers a reason to time out.

@alexcrichton
Copy link
Member Author

It can indeed! That's where to fix it we'll need to do tarball extraction in parallel.

@ehuss
Copy link
Contributor

ehuss commented Oct 5, 2018

I've been looking through this. I don't think I fully understand the details, but it seems to work. Would this continue to be necessary if the extraction was done in a separate thread?

@alexcrichton
Copy link
Member Author

It sort of depend how we'd do extraction, but I think it would be required, yeah. Right now before we actually extract the tarball we reacquire a file lock that was previously dropped after we saw that it didn't exist. This acquisition can take a long time and, if it does, a message is printed to the console saying we're waiting on a lock. We wouldn't want to fork of 5 tarball extractions to all print "Waiting on ..." all at once, so I think the main thread will continue to be the one to acquire file locks.

To that end we can still have long pauses on the main thread we'll need to account for in timing out.

I am slightly worried though in that this feature is definitely balooning in complexity quickly, although I'm not sure how that's best handled...

@bors
Copy link
Collaborator

bors commented Oct 12, 2018

☔ The latest upstream changes (presumably #6166) made this pull request unmergeable. Please resolve the merge conflicts.

This is what curl recommends we use so let's be sure to not mess up any
timers by accident!
Prevoiusly retries were tracked per-session, which meant that if you
were downloading 100 packages and they all failed spuriously for the
same reason you'd immediately abort. Instead though let's keep track of
retries per package so if they're all coupled on one connection a
failure of one will end up retrying all of them. We want to make sure
that we actually retry again!
This commit fixes rust-lang#6125 by ensuring that while we're extracting tarballs
or doing other synchronous work like grabbing file locks we're not
letting the timeout timers of each HTTP transfer keep ticking. This is
curl's default behavior (which we don't want in this scenario). Instead
the timeout logic is inlined directly and we manually account for the
synchronous work happening not counting towards timeout limits.

Closes rust-lang#6125
@alexcrichton
Copy link
Member Author

Rebased!

@alexcrichton
Copy link
Member Author

ping @ehuss, have you had more time to think about this? With the beta cutoff happening soon I think we'll definitely want to merge this or back out parallel donwloads on the beta branch

@ehuss
Copy link
Contributor

ehuss commented Oct 17, 2018

My concern is that this seems to add a lot of complexity. If the core issue is that wait blocks too long, would an alternative that avoids blocking be reasonable? I have a prototype that does extraction in the background, and it seems to work well and resolves the original issue. It also has the benefit that it makes things faster for everyone. I can probably polish it off in a day or two, although I had some questions about it.

Also, I'm not sure if I see it being as critical, since parallel downloads is opt-in, and it only seems to have an issue in a pathological case. Unless I'm missing something? I would definitely opt to merge this over backing anything out.

@SimonSapin
Copy link
Contributor

since parallel downloads is opt-in

Is it? When I hit #6125 I’m pretty sure I had upgraded the toolchain without opting into anything new.

@SimonSapin
Copy link
Contributor

Also you can call this case pathological but it was with a real crate, not a synthetic stress test.

@alexcrichton
Copy link
Member Author

Oh so I should probably have been more clear, but #6005 actually did enable parallel downloads by default. It just disabled HTTP/2 by default! The parallel downloads part comes about from the restructuring of the code to use curl's Multi, and the parallelism by default comes from multiple connections to a domain rather than multiplexing over one connection with HTTP/2.

This indeed is a pathological case if we have one-at-a-time downloads, but with parallel I think it can actually happen somewhat commonly!

@ehuss this does remind me, though, that I could work on an alternative patch to switch to one-by-one downloads using the same infrastructure we have now, I think that'd be a much smaller patch compared to this. Do you think that'd be best to land for beta?

@ehuss
Copy link
Contributor

ehuss commented Oct 17, 2018

Ah, for some reason I was thinking it only mattered that the http/2 connection was being held open, but I was mistaken of the other scenario with unfinished transfers in flight.

I think the background extraction would be the best approach. It helps everyone, and doesn't look too risky. Do you not agree?

(Also, I think being able to configure the concurrency would be good regardless what else we do.)

@alexcrichton
Copy link
Member Author

Yeah background extraction is definitely the best way to go here long term I think, I was just brainstorming other short-term solutions while that's worked on. If you've got it almost ready to go though then we can do that instead!

@ehuss
Copy link
Contributor

ehuss commented Oct 17, 2018

I can probably post it today or tomorrow, and we can see how risky it looks. I just had some questions, which I guess I'll just ask here:

  1. How generic should Sources be? Right now I implemented it as sending a closure to a thread, which allows the Source to decide exactly what to do. However, only one impl really matters (RegistrySource), and using closures makes it a little more complex. An alternative is to have the Source respond with (cache_path, src_path, filename) and the background thread would just be hard-coded to save and extract in those directories. It would be simpler and more straight-forward, but maybe less flexible. Do either of these approaches sound better?

  2. How detailed should the status bar be? Right now, after all the downloads are finished, there might be some background extractions in progress. Right now it just says "Extracting ..." without any more details. The vast majority of the time that status will only display for a fraction of a second. However, in some cases (very slow hard drive, very large crates), it might be displayed for a few seconds. Is it worth it to print more information? If so, I would need to keep track of in-flight extractions (either just a count, or the names).

@alexcrichton
Copy link
Member Author

While Source started out as uber-generic it's become much less so over time, so I'd lean towards the simpler solution of just passing a tuple/struct around instead of using generic closures. For the status bar I think just saying "extracting ..." is fine! We can always add more detailed information later if necessary

@ehuss
Copy link
Contributor

ehuss commented Oct 18, 2018

I spent some time cleaning up my background branch, and decided it is probably too risky. I'm unhappy with the way it turned out. Some problems I ran into:

  • Acquiring a file lock requires holding a Config (for shell output to say "Blocking") which I don't know how to send that to a thread. So the file locking is still synchronous.
  • The progress bar was a little more tricky than I anticipated. It now says "Downloading (extracting...)" which is unsatisfying. This could be improved, but would complicate things quite a bit.
  • Downloading during builds still blocks for extracting in between each batch through the unit_dependencies loop. It's better than before, and it shouldn't cause timeout issues (because there shouldn't be any in-flight transfers), but it's a little weird.
  • The code still looks quite ugly to me.

On the bright side, it does fix the original issue, and it does make things slightly faster (~20% faster for that version of servo on a slow hard drive). Here's the code if you want to look: ehuss@872bcb6 Let me know if you'd still be interested in a PR.

So I think I'm r+ this PR if you are OK with it. Or if you want to do something different, let me know and I'll try to help. I'm not sure if you meant by the one-by-one comment of just changing the max_host_connections or something else. Sorry for the delay and flip-flopping.

@alexcrichton
Copy link
Member Author

Ok thanks for the info! I'm ok going with this version for now and we can always figure out how to get async extraction later should be fine. I suspect that we'll want to go "full async" here eventually and basically have an async event loop (like curl is already) where work for tarball extraction and/or acquiring file locks is all farmed out to this async loop. That means that something like a download would simply return a Future from the Box<Source> and it'd internally capture everything like file locks, tarball extraction, curl download handling, etc.

That's a bit of a larger scale refactor but I think it's probably the best looking implementation in terms of cleanliness and whatnot?

@alexcrichton
Copy link
Member Author

Oh and for one-by-one it's not actually as easy as I thought it would be to retrofit, but I don't think it'd be the end of the world to throw it in last second.

@ehuss
Copy link
Contributor

ehuss commented Nov 3, 2018

@alexcrichton Have you decided what you want to do about this?

@alexcrichton
Copy link
Member Author

Oh jeez I've sort of forgotten the context here... I think I'd probably prefer to land this as-is at this point, I'm finding it hard keeping all other possibilities in my head :(

@ehuss
Copy link
Contributor

ehuss commented Nov 6, 2018

@bors r+

Yea, I don't see anything specifically wrong with this other than it is complicated. Maybe someday in the future it will be easier to extract asynchronously which might make this less necessary.

@bors
Copy link
Collaborator

bors commented Nov 6, 2018

📌 Commit 0b0f089 has been approved by ehuss

@bors
Copy link
Collaborator

bors commented Nov 6, 2018

⌛ Testing commit 0b0f089 with merge 022bfa397829fdbcbd283fd11aa2898402c0b56d...

@bors
Copy link
Collaborator

bors commented Nov 6, 2018

💔 Test failed - status-appveyor

@alexcrichton
Copy link
Member Author

@bors
Copy link
Collaborator

bors commented Nov 6, 2018

⌛ Testing commit 0b0f089 with merge f938c4f...

bors added a commit that referenced this pull request Nov 6, 2018
Fix timeouts firing while tarballs are extracted

This commit fixes #6125 by ensuring that while we're extracting tarballs
or doing other synchronous work like grabbing file locks we're not
letting the timeout timers of each HTTP transfer keep ticking. This is
curl's default behavior (which we don't want in this scenario). Instead
the timeout logic is inlined directly and we manually account for the
synchronous work happening not counting towards timeout limits.

Closes #6125
@bors
Copy link
Collaborator

bors commented Nov 6, 2018

☀️ Test successful - status-appveyor, status-travis
Approved by: ehuss
Pushing f938c4f to master...

@bors bors merged commit 0b0f089 into rust-lang:master Nov 6, 2018
@alexcrichton alexcrichton deleted the less-timeouts-u branch November 6, 2018 21:02
bors added a commit that referenced this pull request Nov 6, 2018
[beta]: Fix timeouts firing with tarball extraction

This is a backport of #6130 to the 1.31.0 branch
alexcrichton added a commit to alexcrichton/cargo that referenced this pull request Nov 8, 2018
This commit switches the timeout logic implemented in rust-lang#6130 to timeout
an entire batch of downloads instead of each download individually.
Previously if *any* pending download didn't receive data in 30s we would
time out, or if *any* pending download didn't receive 10 bytes in 30s we
would time out. On very slow network connections this is highly likely
to happen as a trickle of incoming bytes may not be spread equally
amongst all connections, and not all connections may actually be active
at any one point in time.

The fix is to instead apply timeout logic for an entire batch of
downloads. Only if zero total data isn't received in the timeout window
do we time out. Or in other words, if any data for any download is
receive we consider it as not being timed out. Similarly any progress on
any download counts as progress towards our speed limit.

Closes rust-lang#6284
bors added a commit that referenced this pull request Nov 9, 2018
Timeout batch downloads, not each download

This commit switches the timeout logic implemented in #6130 to timeout
an entire batch of downloads instead of each download individually.
Previously if *any* pending download didn't receive data in 30s we would
time out, or if *any* pending download didn't receive 10 bytes in 30s we
would time out. On very slow network connections this is highly likely
to happen as a trickle of incoming bytes may not be spread equally
amongst all connections, and not all connections may actually be active
at any one point in time.

The fix is to instead apply timeout logic for an entire batch of
downloads. Only if zero total data isn't received in the timeout window
do we time out. Or in other words, if any data for any download is
receive we consider it as not being timed out. Similarly any progress on
any download counts as progress towards our speed limit.

Closes #6284
alexcrichton added a commit to alexcrichton/cargo that referenced this pull request Nov 9, 2018
This commit switches the timeout logic implemented in rust-lang#6130 to timeout
an entire batch of downloads instead of each download individually.
Previously if *any* pending download didn't receive data in 30s we would
time out, or if *any* pending download didn't receive 10 bytes in 30s we
would time out. On very slow network connections this is highly likely
to happen as a trickle of incoming bytes may not be spread equally
amongst all connections, and not all connections may actually be active
at any one point in time.

The fix is to instead apply timeout logic for an entire batch of
downloads. Only if zero total data isn't received in the timeout window
do we time out. Or in other words, if any data for any download is
receive we consider it as not being timed out. Similarly any progress on
any download counts as progress towards our speed limit.

Closes rust-lang#6284
@ehuss ehuss modified the milestones: 1.32.0, 1.31.0 Feb 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Timeout while fetching Servo’s dependencies on Windows
6 participants