-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(buck2_common): be less aggressive about HTTP retry warnings #321
Conversation
Summary: When downloading from sites like crates.io, we can pretty easily get rate-limited when a lot of our crates are uncached. Don't be so aggressive about warning; most small backoffs won't be too noticeable. As a note, we _don't_ emit the error message we get from the callstack here in the message; the logic is that it's already a retriable error, so it doesn't make much sense to spam an error that we'd probably get again anyway, and that we ultimately are going to re-attempt. This should just make the UX a little more polished. Also emit the domain name we're hitting problems with. Test Plan: Change the new warning conditional from 30 seconds to zero seconds. Build buck2 with buck2. Observe the new error message, which helpfully points out we're connecting to crates.io. Change the conditional back to 30 seconds. Rebuild buck2 with buck2. Note that no warnings appear, because the exponential backoff only results in a few small 2-second retries. GitHub Issue: Fixes facebook#316 Signed-off-by: Austin Seipp <aseipp@pobox.com> Change-Id: Ixoxuusxlkkrspllqqvukzrumnrytmpko
Not a fan of trying to parse the host of out of the url and doing so with an unwrap here. Can we just remove that? Internally, we also wouldn't actually want this as-is, since hiding the URL makes the message pretty useless as a whole. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit concerned that we're really far away from what you're looking for here. Its gonna be hard to reconcile those two things with small adjustements.
--
I think I'd go a different route here. I'd suggest adding an event (in buck2.proto) for this and then just dispatching that to the log file but not producing any user-visible output. This way we'd retain debug capabilities without actually being verbose.
This is unfortunately a bit tricky to do because this code gets invoked (mostly) in places that don't have an event dispatcher set. There is a bit of prior art to this, which is to just dispatch to all connected commands (which is mostly fine as far as debugging is concerned):
https://github.com/facebook/buck2/blob/main/app/buck2/src/panic.rs#L163-L176
Unfortunately this still not great because some places don't have that either (the log upload places, mostly)
So I think what I'd suggest doing is:
- Create a trait for dispatching this warnng
- Have an implementation that fires off an event like noted above,
- Use it in materialize/http.rs
- Have an implementation that just logs
- Use it in manifold.rs
b.as_secs(), | ||
http_error | ||
); | ||
let secs = b.as_secs(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we maybe check elapsed + this backoff, and then use that and the 30 second threshold? I'm a bit concerned about not showing anything until we start facing a very large backoff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also note that I can' find any places that retry for 30 seconds so this really just amounts to just deleting the log message.
tracing::warn!( | ||
host = host, | ||
retries = retries, | ||
"Egregious HTTP backoff - retrying slow request in {} seconds", | ||
secs, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Can we avoid calling the backoff "egregious"?
- Also, do note we are not retrying a slow request here we're rather retrying a failed one
- We shouldn't hide the error here
I'm confused, do you mean the whole URL or just the host? Because in the open source version (I don't know if error propagation is different inside Meta), there was NEVER any indication of what the URL or the host was when this warning popped up, even before I wrote this patch; the only thing you saw was typically just the Debug/display usage on a big Actually, the only reason I knew that my initial throttling was from crates.io wasn't because of the message; it was because there weren't any other hosts I was downloading from when bootstrapping a buck2, so it was pretty clear there was only one culprit (and also the only host where I would reasonably be downloading like, 150 things) I also don't think the URL actually is important, and I deliberately chose to only show the host here. I also chose not to show the http error itself anymore, for the same reason. What reason would that be? Well, there are actually few:
This is generally how I approach retriable errors: you should retry silently until it is pretty obviously visible, and only fail loudly when you can no longer do that. That said, I know not everyone will agree with that point; some people always want the full context even on warnings and the host may not seem like enough information. So if you disagree with this I understand, but it wasn't entirely random, either. |
Also, I guess this doesn't cover the case where a host can shitcan you before 30 seconds even occurs, so yeah, the window is totally arbitrary. I'll take any solution that doesn't spam 10 logs a second into superconsole and make the build UX look awful, though, and that doesn't distract me or draw my attention away from relatively benign issues... |
Alright, this does look like a better plan of attack (I somehow missed this comment in favor of the OG one, I guess since I clicked through on my email to reply.) I'll drop this for now and see if I can get this implementation working, thanks! |
@thoughtpolice somewhat unrelated but I noticed this comment:
That doesn't sound very good, and I don't think we see behavior like that on other platforms. I know the console rendering works quite different on windows (aiui it uses some actual like control commands instead of encoding them in control codes in the output stream), so it's not entirely surprising that it's different (though aiui, windows terminals may support terminal control codes these days). Anyway, have you reported this behavior as a separate issues anywhere? If you could do that that'd be great (and if you could get like a screen capture of it, even better). |
Summary: When downloading from sites like crates.io, we can pretty easily get rate-limited when a lot of our crates are uncached. Don't be so aggressive about warning; most small backoffs won't be too noticeable.
As a note, we don't emit the error message we get from the callstack here in the message; the logic is that it's already a retriable error, so it doesn't make much sense to spam an error that we'd probably get again anyway, and that we ultimately are going to re-attempt. This should just make the UX a little more polished.
Also emit the domain name we're hitting problems with.
Test Plan: Change the new warning conditional from 30 seconds to zero seconds. Build buck2 with buck2. Observe the new error message, which helpfully points out we're connecting to crates.io. Change the conditional back to 30 seconds. Rebuild buck2 with buck2. Note that no warnings appear, because the exponential backoff only results in a few small 2-second retries.
GitHub Issue: Fixes #316