Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(buck2_common): be less aggressive about HTTP retry warnings #321

Closed
wants to merge 1 commit into from

Conversation

thoughtpolice
Copy link
Contributor

Summary: When downloading from sites like crates.io, we can pretty easily get rate-limited when a lot of our crates are uncached. Don't be so aggressive about warning; most small backoffs won't be too noticeable.

As a note, we don't emit the error message we get from the callstack here in the message; the logic is that it's already a retriable error, so it doesn't make much sense to spam an error that we'd probably get again anyway, and that we ultimately are going to re-attempt. This should just make the UX a little more polished.

Also emit the domain name we're hitting problems with.

Test Plan: Change the new warning conditional from 30 seconds to zero seconds. Build buck2 with buck2. Observe the new error message, which helpfully points out we're connecting to crates.io. Change the conditional back to 30 seconds. Rebuild buck2 with buck2. Note that no warnings appear, because the exponential backoff only results in a few small 2-second retries.

GitHub Issue: Fixes #316

Summary: When downloading from sites like crates.io, we can pretty easily get
rate-limited when a lot of our crates are uncached. Don't be so aggressive about
warning; most small backoffs won't be too noticeable.

As a note, we _don't_ emit the error message we get from the callstack here in
the message; the logic is that it's already a retriable error, so it doesn't
make much sense to spam an error that we'd probably get again anyway, and
that we ultimately are going to re-attempt. This should just make the UX a
little more polished.

Also emit the domain name we're hitting problems with.

Test Plan: Change the new warning conditional from 30 seconds to zero seconds.
Build buck2 with buck2. Observe the new error message, which helpfully points
out we're connecting to crates.io. Change the conditional back to 30 seconds.
Rebuild buck2 with buck2. Note that no warnings appear, because the exponential
backoff only results in a few small 2-second retries.

GitHub Issue: Fixes facebook#316

Signed-off-by: Austin Seipp <aseipp@pobox.com>
Change-Id: Ixoxuusxlkkrspllqqvukzrumnrytmpko
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 1, 2023
@stepancheg stepancheg requested a review from krallin July 1, 2023 22:15
@krallin
Copy link
Contributor

krallin commented Jul 4, 2023

Not a fan of trying to parse the host of out of the url and doing so with an unwrap here. Can we just remove that?

Internally, we also wouldn't actually want this as-is, since hiding the URL makes the message pretty useless as a whole.

Copy link
Contributor

@krallin krallin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit concerned that we're really far away from what you're looking for here. Its gonna be hard to reconcile those two things with small adjustements.

--

I think I'd go a different route here. I'd suggest adding an event (in buck2.proto) for this and then just dispatching that to the log file but not producing any user-visible output. This way we'd retain debug capabilities without actually being verbose.

This is unfortunately a bit tricky to do because this code gets invoked (mostly) in places that don't have an event dispatcher set. There is a bit of prior art to this, which is to just dispatch to all connected commands (which is mostly fine as far as debugging is concerned):

https://github.com/facebook/buck2/blob/main/app/buck2/src/panic.rs#L163-L176

Unfortunately this still not great because some places don't have that either (the log upload places, mostly)

So I think what I'd suggest doing is:

  • Create a trait for dispatching this warnng
  • Have an implementation that fires off an event like noted above,
    • Use it in materialize/http.rs
  • Have an implementation that just logs
    • Use it in manifold.rs

b.as_secs(),
http_error
);
let secs = b.as_secs();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we maybe check elapsed + this backoff, and then use that and the 30 second threshold? I'm a bit concerned about not showing anything until we start facing a very large backoff.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also note that I can' find any places that retry for 30 seconds so this really just amounts to just deleting the log message.

Comment on lines +82 to +86
tracing::warn!(
host = host,
retries = retries,
"Egregious HTTP backoff - retrying slow request in {} seconds",
secs,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Can we avoid calling the backoff "egregious"?
  • Also, do note we are not retrying a slow request here we're rather retrying a failed one
  • We shouldn't hide the error here

@krallin krallin self-assigned this Jul 4, 2023
@thoughtpolice
Copy link
Contributor Author

thoughtpolice commented Jul 4, 2023

Not a fan of trying to parse the host of out of the url and doing so with an unwrap here. Can we just remove that?

Internally, we also wouldn't actually want this as-is, since hiding the URL makes the message pretty useless as a whole.

I'm confused, do you mean the whole URL or just the host? Because in the open source version (I don't know if error propagation is different inside Meta), there was NEVER any indication of what the URL or the host was when this warning popped up, even before I wrote this patch; the only thing you saw was typically just the Debug/display usage on a big http::streaming_error of some kind, which is frankly completely and totally useless. It literally would just be better to say "Retrying at URL http://..." even if we do nothing else to improve this.

Actually, the only reason I knew that my initial throttling was from crates.io wasn't because of the message; it was because there weren't any other hosts I was downloading from when bootstrapping a buck2, so it was pretty clear there was only one culprit (and also the only host where I would reasonably be downloading like, 150 things)

I also don't think the URL actually is important, and I deliberately chose to only show the host here. I also chose not to show the http error itself anymore, for the same reason. What reason would that be? Well, there are actually few:

  • The URL can actually be long, and superconsole doesn't seem to handle wide stuff like this very well. What will actually happen is it will overflow at something like 80 columns in windows terminal, pushing an extra line onto the screen, and then causing the whole UI to "violently" repaint the entire build UX. So it's jarring to have a really wide message; part of the motivation was that the display'd error was hundreds of characters long and would repeat 5 or 6 times, and so like 10 extra lines constantly appear and disappear out of your terminal, making the UX significantly more jarring...
    • Yes, this is strictly a separate issue, but I'd like not to make a recursive stack of tasks that might become 20 extra tasks long to just clean up this one thing, unless it's absolutely necessary in this case, but it doesn't seem to be...
  • The host is what is throttling you, not necessarily the URL! Yes, there is some mixture overlap the two (like when an operator sees a hot resource that might be causing hot key loads, and might put up a rate limiter or load shed on that one key). But generally speaking I don't think this is very weird, and in 99.99% of places you download, it will be a host setting, not a specific item. But more importantly...
  • It isn't an error! If the backoff works, because the service allows you to succeed, then there was no error and the URL isn't relevant. But if there is a failure due to giving up from retrying too many times, then you will get the URL too and a loud explosion when the build fails, which will tell you exactly what the problematic URL was that failed after too many times. (Really the URL isn't relevant in that regard, because it could be any URL that failed after finally hitting the limit.) The message by itself isn't an error, it's merely an informative tell that the host has throttled you multiple times. But that still doesn't mean anything will fail!
    • I think of this in a kind of "queueing theory" manner. If you have a queue of actions and one of those actions "drops out", there are generally two outcomes: the participant gets back in the queue, in which case the problem was transient and not useful to yell about (i.e. the attempt is retried), or the message is gone eternally; typically it was not retried at all because the message wasn't that important, for example. In our case that second part never really happens, but the "get out of the queue and get back in" thing does.
    • Ultimately, if a retry occurs and succeeds two seconds later, like it was in Warnings that occur when HTTP2 downloads are throttled are a little bit aggressive #316, the thing I'm left wondering, as a user, is: why was I warned about this at all? It's not a big deal!

This is generally how I approach retriable errors: you should retry silently until it is pretty obviously visible, and only fail loudly when you can no longer do that. That said, I know not everyone will agree with that point; some people always want the full context even on warnings and the host may not seem like enough information. So if you disagree with this I understand, but it wasn't entirely random, either.

@thoughtpolice
Copy link
Contributor Author

thoughtpolice commented Jul 4, 2023

Also, I guess this doesn't cover the case where a host can shitcan you before 30 seconds even occurs, so yeah, the window is totally arbitrary. I'll take any solution that doesn't spam 10 logs a second into superconsole and make the build UX look awful, though, and that doesn't distract me or draw my attention away from relatively benign issues...

@thoughtpolice
Copy link
Contributor Author

thoughtpolice commented Jul 4, 2023

I think I'd go a different route here. I'd suggest adding an event (in buck2.proto) for this and then just dispatching that to the log file but not producing any user-visible output. This way we'd retain debug capabilities without actually being verbose.

This is unfortunately a bit tricky to do because this code gets invoked (mostly) in places that don't have an event dispatcher set. There is a bit of prior art to this, which is to just dispatch to all connected commands (which is mostly fine as far as debugging is concerned):

https://github.com/facebook/buck2/blob/main/app/buck2/src/panic.rs#L163-L176

Unfortunately this still not great because some places don't have that either (the log upload places, mostly)

So I think what I'd suggest doing is:

  • Create a trait for dispatching this warnng

  • Have an implementation that fires off an event like noted above,

    • Use it in materialize/http.rs
  • Have an implementation that just logs

    • Use it in manifold.rs

Alright, this does look like a better plan of attack (I somehow missed this comment in favor of the OG one, I guess since I clicked through on my email to reply.) I'll drop this for now and see if I can get this implementation working, thanks!

@cjhopman
Copy link
Contributor

cjhopman commented Jul 5, 2023

@thoughtpolice somewhat unrelated but I noticed this comment:

What will actually happen is it will overflow at something like 80 columns in windows terminal, pushing an extra line onto the screen, and then causing the whole UI to "violently" repaint the entire build UX

That doesn't sound very good, and I don't think we see behavior like that on other platforms.

I know the console rendering works quite different on windows (aiui it uses some actual like control commands instead of encoding them in control codes in the output stream), so it's not entirely surprising that it's different (though aiui, windows terminals may support terminal control codes these days).

Anyway, have you reported this behavior as a separate issues anywhere? If you could do that that'd be great (and if you could get like a screen capture of it, even better).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Warnings that occur when HTTP2 downloads are throttled are a little bit aggressive
4 participants