fix(buck2_common): be less aggressive about HTTP retry warnings #321

thoughtpolice · 2023-07-01T22:07:26Z

Summary: When downloading from sites like crates.io, we can pretty easily get rate-limited when a lot of our crates are uncached. Don't be so aggressive about warning; most small backoffs won't be too noticeable.

As a note, we don't emit the error message we get from the callstack here in the message; the logic is that it's already a retriable error, so it doesn't make much sense to spam an error that we'd probably get again anyway, and that we ultimately are going to re-attempt. This should just make the UX a little more polished.

Also emit the domain name we're hitting problems with.

Test Plan: Change the new warning conditional from 30 seconds to zero seconds. Build buck2 with buck2. Observe the new error message, which helpfully points out we're connecting to crates.io. Change the conditional back to 30 seconds. Rebuild buck2 with buck2. Note that no warnings appear, because the exponential backoff only results in a few small 2-second retries.

GitHub Issue: Fixes #316

Summary: When downloading from sites like crates.io, we can pretty easily get rate-limited when a lot of our crates are uncached. Don't be so aggressive about warning; most small backoffs won't be too noticeable. As a note, we _don't_ emit the error message we get from the callstack here in the message; the logic is that it's already a retriable error, so it doesn't make much sense to spam an error that we'd probably get again anyway, and that we ultimately are going to re-attempt. This should just make the UX a little more polished. Also emit the domain name we're hitting problems with. Test Plan: Change the new warning conditional from 30 seconds to zero seconds. Build buck2 with buck2. Observe the new error message, which helpfully points out we're connecting to crates.io. Change the conditional back to 30 seconds. Rebuild buck2 with buck2. Note that no warnings appear, because the exponential backoff only results in a few small 2-second retries. GitHub Issue: Fixes facebook#316 Signed-off-by: Austin Seipp <aseipp@pobox.com> Change-Id: Ixoxuusxlkkrspllqqvukzrumnrytmpko

krallin · 2023-07-04T16:36:31Z

Not a fan of trying to parse the host of out of the url and doing so with an unwrap here. Can we just remove that?

Internally, we also wouldn't actually want this as-is, since hiding the URL makes the message pretty useless as a whole.

krallin

I'm a bit concerned that we're really far away from what you're looking for here. Its gonna be hard to reconcile those two things with small adjustements.

--

I think I'd go a different route here. I'd suggest adding an event (in buck2.proto) for this and then just dispatching that to the log file but not producing any user-visible output. This way we'd retain debug capabilities without actually being verbose.

This is unfortunately a bit tricky to do because this code gets invoked (mostly) in places that don't have an event dispatcher set. There is a bit of prior art to this, which is to just dispatch to all connected commands (which is mostly fine as far as debugging is concerned):

https://github.com/facebook/buck2/blob/main/app/buck2/src/panic.rs#L163-L176

Unfortunately this still not great because some places don't have that either (the log upload places, mostly)

So I think what I'd suggest doing is:

Create a trait for dispatching this warnng
Have an implementation that fires off an event like noted above,
- Use it in materialize/http.rs
Have an implementation that just logs
- Use it in manifold.rs

krallin · 2023-07-04T16:37:43Z

app/buck2_common/src/http/retries.rs

-                        b.as_secs(),
-                        http_error
-                    );
+                    let secs = b.as_secs();


Could we maybe check elapsed + this backoff, and then use that and the 30 second threshold? I'm a bit concerned about not showing anything until we start facing a very large backoff.

I also note that I can' find any places that retry for 30 seconds so this really just amounts to just deleting the log message.

krallin · 2023-07-04T16:39:21Z

app/buck2_common/src/http/retries.rs

+                        tracing::warn!(
+                            host = host,
+                            retries = retries,
+                            "Egregious HTTP backoff - retrying slow request in {} seconds",
+                            secs,


Can we avoid calling the backoff "egregious"?

Also, do note we are not retrying a slow request here we're rather retrying a failed one

We shouldn't hide the error here

thoughtpolice · 2023-07-04T17:29:24Z

Not a fan of trying to parse the host of out of the url and doing so with an unwrap here. Can we just remove that?

Internally, we also wouldn't actually want this as-is, since hiding the URL makes the message pretty useless as a whole.

I'm confused, do you mean the whole URL or just the host? Because in the open source version (I don't know if error propagation is different inside Meta), there was NEVER any indication of what the URL or the host was when this warning popped up, even before I wrote this patch; the only thing you saw was typically just the Debug/display usage on a big http::streaming_error of some kind, which is frankly completely and totally useless. It literally would just be better to say "Retrying at URL http://..." even if we do nothing else to improve this.

Actually, the only reason I knew that my initial throttling was from crates.io wasn't because of the message; it was because there weren't any other hosts I was downloading from when bootstrapping a buck2, so it was pretty clear there was only one culprit (and also the only host where I would reasonably be downloading like, 150 things)

I also don't think the URL actually is important, and I deliberately chose to only show the host here. I also chose not to show the http error itself anymore, for the same reason. What reason would that be? Well, there are actually few:

The URL can actually be long, and superconsole doesn't seem to handle wide stuff like this very well. What will actually happen is it will overflow at something like 80 columns in windows terminal, pushing an extra line onto the screen, and then causing the whole UI to "violently" repaint the entire build UX. So it's jarring to have a really wide message; part of the motivation was that the display'd error was hundreds of characters long and would repeat 5 or 6 times, and so like 10 extra lines constantly appear and disappear out of your terminal, making the UX significantly more jarring...
- Yes, this is strictly a separate issue, but I'd like not to make a recursive stack of tasks that might become 20 extra tasks long to just clean up this one thing, unless it's absolutely necessary in this case, but it doesn't seem to be...
The host is what is throttling you, not necessarily the URL! Yes, there is some mixture overlap the two (like when an operator sees a hot resource that might be causing hot key loads, and might put up a rate limiter or load shed on that one key). But generally speaking I don't think this is very weird, and in 99.99% of places you download, it will be a host setting, not a specific item. But more importantly...
It isn't an error! If the backoff works, because the service allows you to succeed, then there was no error and the URL isn't relevant. But if there is a failure due to giving up from retrying too many times, then you will get the URL too and a loud explosion when the build fails, which will tell you exactly what the problematic URL was that failed after too many times. (Really the URL isn't relevant in that regard, because it could be any URL that failed after finally hitting the limit.) The message by itself isn't an error, it's merely an informative tell that the host has throttled you multiple times. But that still doesn't mean anything will fail!
- I think of this in a kind of "queueing theory" manner. If you have a queue of actions and one of those actions "drops out", there are generally two outcomes: the participant gets back in the queue, in which case the problem was transient and not useful to yell about (i.e. the attempt is retried), or the message is gone eternally; typically it was not retried at all because the message wasn't that important, for example. In our case that second part never really happens, but the "get out of the queue and get back in" thing does.
- Ultimately, if a retry occurs and succeeds two seconds later, like it was in Warnings that occur when HTTP2 downloads are throttled are a little bit aggressive #316, the thing I'm left wondering, as a user, is: why was I warned about this at all? It's not a big deal!

This is generally how I approach retriable errors: you should retry silently until it is pretty obviously visible, and only fail loudly when you can no longer do that. That said, I know not everyone will agree with that point; some people always want the full context even on warnings and the host may not seem like enough information. So if you disagree with this I understand, but it wasn't entirely random, either.

thoughtpolice · 2023-07-04T17:34:20Z

Also, I guess this doesn't cover the case where a host can shitcan you before 30 seconds even occurs, so yeah, the window is totally arbitrary. I'll take any solution that doesn't spam 10 logs a second into superconsole and make the build UX look awful, though, and that doesn't distract me or draw my attention away from relatively benign issues...

thoughtpolice · 2023-07-04T22:17:21Z

I think I'd go a different route here. I'd suggest adding an event (in buck2.proto) for this and then just dispatching that to the log file but not producing any user-visible output. This way we'd retain debug capabilities without actually being verbose.

This is unfortunately a bit tricky to do because this code gets invoked (mostly) in places that don't have an event dispatcher set. There is a bit of prior art to this, which is to just dispatch to all connected commands (which is mostly fine as far as debugging is concerned):

https://github.com/facebook/buck2/blob/main/app/buck2/src/panic.rs#L163-L176

Unfortunately this still not great because some places don't have that either (the log upload places, mostly)

So I think what I'd suggest doing is:

Create a trait for dispatching this warnng

Have an implementation that fires off an event like noted above,

Use it in materialize/http.rs

Have an implementation that just logs

Use it in manifold.rs

Alright, this does look like a better plan of attack (I somehow missed this comment in favor of the OG one, I guess since I clicked through on my email to reply.) I'll drop this for now and see if I can get this implementation working, thanks!

cjhopman · 2023-07-05T18:29:34Z

@thoughtpolice somewhat unrelated but I noticed this comment:

What will actually happen is it will overflow at something like 80 columns in windows terminal, pushing an extra line onto the screen, and then causing the whole UI to "violently" repaint the entire build UX

That doesn't sound very good, and I don't think we see behavior like that on other platforms.

I know the console rendering works quite different on windows (aiui it uses some actual like control commands instead of encoding them in control codes in the output stream), so it's not entirely surprising that it's different (though aiui, windows terminals may support terminal control codes these days).

Anyway, have you reported this behavior as a separate issues anywhere? If you could do that that'd be great (and if you could get like a screen capture of it, even better).

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 1, 2023

stepancheg requested a review from krallin July 1, 2023 22:15

thoughtpolice mentioned this pull request Jul 2, 2023

Can we have Cargo.lock in the repository? #308

Open

krallin requested changes Jul 4, 2023

View reviewed changes

krallin self-assigned this Jul 4, 2023

thoughtpolice closed this Jul 4, 2023

thoughtpolice mentioned this pull request Jul 7, 2023

Console UX: tracing warnings tend to screw with repaints and offset superconsole #332

Open

thoughtpolice deleted the push-xoxuusxlkkrs branch July 8, 2023 23:45

thoughtpolice mentioned this pull request Jul 13, 2023

refactor(buck2_common): don't overload the user with HTTP retry backoff warnings #342

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(buck2_common): be less aggressive about HTTP retry warnings #321

fix(buck2_common): be less aggressive about HTTP retry warnings #321

thoughtpolice commented Jul 1, 2023

krallin commented Jul 4, 2023 •

edited

Loading

krallin left a comment •

edited

Loading

krallin Jul 4, 2023

krallin Jul 4, 2023

krallin Jul 4, 2023

thoughtpolice commented Jul 4, 2023 •

edited

Loading

thoughtpolice commented Jul 4, 2023 •

edited

Loading

thoughtpolice commented Jul 4, 2023 •

edited

Loading

cjhopman commented Jul 5, 2023

fix(buck2_common): be less aggressive about HTTP retry warnings #321

fix(buck2_common): be less aggressive about HTTP retry warnings #321

Conversation

thoughtpolice commented Jul 1, 2023

krallin commented Jul 4, 2023 • edited Loading

krallin left a comment • edited Loading

Choose a reason for hiding this comment

krallin Jul 4, 2023

Choose a reason for hiding this comment

krallin Jul 4, 2023

Choose a reason for hiding this comment

krallin Jul 4, 2023

Choose a reason for hiding this comment

thoughtpolice commented Jul 4, 2023 • edited Loading

thoughtpolice commented Jul 4, 2023 • edited Loading

thoughtpolice commented Jul 4, 2023 • edited Loading

cjhopman commented Jul 5, 2023

krallin commented Jul 4, 2023 •

edited

Loading

krallin left a comment •

edited

Loading

thoughtpolice commented Jul 4, 2023 •

edited

Loading

thoughtpolice commented Jul 4, 2023 •

edited

Loading

thoughtpolice commented Jul 4, 2023 •

edited

Loading