-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
proxy.golang.org: Unusual traffic to git hosting service from Go #44577
Comments
CC @hyangah @heschik |
Hi @ddevault Thanks for filing the issue.
There could be several sources of this traffic, or a combination of sources, so we will work to narrow this down. It sounds like this issue is focused on lowering the traffic, or at a minimum, documenting it so that the purpose of the traffic is clearer. We'll leave the User-Agent discussion in #44468. I have a few questions for you which can help us narrow down the cause. You mentioned "in the past few weeks" that you've seen this behavior change. Was there a huge spike in traffic that started on a particular day that you know of, or was it gradual to the point where things are now? For example, was it right when Go 1.16 came out on Feb 16, or were you seeing this earlier than that? This can help us narrow down the root cause. Are you seeing this across all of the code on git.sr.ht, or is it focused on a specific module or set of modules? Additionally, do you have any data about the total volume of requests over a typical hour, or patterns you're seeing? You noted that the traffic you are seeing now is not reasonable. It would be helpful to hear your opinion of what an acceptable volume of requests would look like? Some thoughts about where this traffic could be coming from:
I also want to note that proxy.golang.org isn't trying to do any crawling. It fetches and pre-caches code that we can be confident that users will want, fetching and refreshing data that users have already asked for. |
I'll give approximate answers, but knowing that these are the questions you want answered, I will be able to collect more specific answers when the behavior is next observable.
This has occured on February 10th and 14th, then at least once per day from the 18th onwards. It seems likely that this co-incides with the Go 1.16 release, which notably made some major changes to how modules are used.
If there's any discernable pattern to the repositories affected, it's hard for me to tell. It would help if you could characterize the IP address I gave (74.125.182.164) and share a range of IPs which are also likely to be implicated in the same behavior so I can extract just the relevant part of the logs.
Hm, in my experience, it happens for a few tens of minutes at a rate of between 2 and 10 requests per second. Googlebot on its default settings, for example, only does one request every 5 seconds or something like that. In any case, if it's an automated process, it should be fetching robots.txt and letting me, the sysadmin of the remote host, configure its crawling parameters. The precise rate is not especially important, but I would like to tune it to a predictable value and then update our assumptions about network monitoring so we aren't getting a bunch of false alarms from Go crawling our servers.
This list can be narrowed down: every request has appeared to come from a Google-owned IP address, so we can eliminate behavior from end users (and also hold Google accountable for whatever their servers are up to). I know that godoc.org |
Note: it would be helpful of Go's release notes were dated. |
This page: https://golang.org/doc/devel/release.html has dates for every release, if you need them. |
Thanks! |
@ddevault Thanks for your fast response. To make it easier to figure out exactly which traffic is coming from us, we're going to go ahead and work on setting the User-Agent for proxy.golang.org requests, per #44468. Then you'll be able to more easily discern which requests are coming from us (rather than filtering by IP), and the logs will hopefully be easier to collect. It's a good thing to do either way, so we'll prioritize doing this. I'll follow back up here once those changes are in production. |
Sounds good, thanks! |
With the new User-Agent in place, I can characterize the behavior more concretely now. Over the past hour, I've received 1,912 requests from proxy.golang.org, from IP blocks 74.125.0.0/16 and 173.194.0.0/16. The full list of requests is available here, with columns for the request IP, date, and hash of the module URL. Redundant requests per IP address are somewhat reasonable: https://paste.sr.ht/~sircmpwn/986d4c2e3f5909385b19adf6fa15bc789bff8708 But redundant requests across all IPs are less so: https://paste.sr.ht/~sircmpwn/b46ad0b13e864923df80cb8e8285bf1661e6f872 There is some room for improvement here. |
Some specific recommendations:
I don't mind the volume if it's legitimate traffic, but an effort should be made to reduce redundant load, and to give the sysadmins some knobs to control the load. Another question: do you make a fresh clone every time, or do you keep the repo around and fetch only the difference? You should maintain an LRU cache of clones and freshen them up on subsequent requests. |
@ddevault |
Thanks. Would also appreciate answers to my more specific questions when you have the opportunity. |
@ddevault
The short answer is that yes we make a fresh clone every time. I want to provide more information below to clarify this: It's not the case that a However, we suspect that the majority of the traffic you are experiencing is not due to module resolution of new modules, but instead from our refresh jobs. These already know the resolved path, so the go command doesn't need to do this resolution. For example, From proxy.golang.org's perspective, we shell out work to the go command, and it's up to the go command to decide how best to retrieve the information and pass it back to us. So the idea of keeping a cache of clones around isn't practical, nor would it help the go command. However, something we can do right now is improve our refresh jobs to help with load, so we're going to look into that. |
It doesn't seem reasonable to change the go command, but I would argue that proxy.golang.org is in a unique position among users of the go command, and as such it seems reasonable to suggest that it would be open to an improved implementation which better suits its unique setting. I mean, if not, Go is just shoving the complexity burden onto software hosting services, and with wasteful and redundant requests.
Sounds like a good start. |
Hi @ddevault. I wanted to give an update that we've gone ahead and improved our refresh jobs to hopefully lead to less duplication of request traffic to origin servers. This could yield a 2-3x drop in requests. |
Thanks! |
I can confirm that the load is reduced, but it is still a bit heavy all things considered. |
Actually, I went to quantify my impression of a reduced load and found that it has not changed much at all. It has gotten worse in some respects. Fresh data: in the past hour, we received about 2500 requests from a GoModuleMirror User-Agent. The new breakdowns by IP and module are here: https://paste.sr.ht/~sircmpwn/8636039e4bff971f8b9028d22ad05984f4e7a24c https://paste.sr.ht/~sircmpwn/4f4636fed5f672aa3cccca527b95476fddef3ca5 |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Gah, I jumped the gun here. Sorry. This logs spans over an hour, not a minute. Update: it looks like the traffic was a huge burst from an unrelated source and the Go proxy had the bad luck to submit a large batch of requests on the tail end of those logs. Then I collected all of the recent requests from that IP range and misread the timestamps on the logs. Sorry for the noise. |
Yesterday, GoModuleMirror downloaded 4 gigabytes of data from my server requesting a single module over 500 times (log attached). As far as I know, I am the only person in the world using this Go module. I would greatly appreciate some caching or at least rate limiting. For reference, here's a bandwidth graph. It seems to have started about a week ago. |
Thanks for letting us know @BenLubar. We're continuing to look into this, and appreciate the extra details. |
Please re-prioritize this, if any organization with more accountability than Google was DDoSing hosting providers since February then it'd be front page news and their ISP would have cut them off. |
@ddevault We are taking this seriously, and have been taking strides to improve this since it was reported to us. For transparency, we spent a while discussing options for how we can approach this internally on Friday of last week. The fix for this isn't straightforward, and impacts all users of the proxy, not just origin servers. We have a job which runs at regular intervals which fetches fresh data from origin servers to make sure that the end user has a good experience when they try to download Go source code. If we change this, users will have to wait longer while we fetch things on-the-fly, or receive stale data, which can greatly slow down developer's builds. Currently, a single request for a module may cause refetch traffic for several days after. That may be what you are experiencing. One idea we've been discussing is to make it such that our job only make refresh requests if the module is deemed "popular" enough (e.g. the module has been requested 100 times in the last week). However, this is going to require some re-architecting, and database changes, so it is taking some time to work through. In the meantime, if you would prefer, we can turn off all refresh traffic for your domain while we continue to improve this on our end. That would mean that the only traffic you would receive from us would be the result of a request directly from a user. This may impact the freshness of your domain's data which users receive from our servers, since we need to have some caching on our end to prevent too frequent fetches. We can do the same thing for @BenLubar's domain if preferred. |
Greater transparency and communication on the issue would be appreciated and would go far towards improving the optics of this problem. Have you considered the robots.txt approach, which would simply allow the sysadmin to tune the rate at which you will scrape their service? The best option puts the controls in the hands of the sysadmins you're affecting. This is what the rest of the internet does. Also, this probably isn't what you want to hear, but maybe the proxy is a bad idea in the first place. For my part, I use |
That would be helpful, thanks. |
@BenLubar - no it hasn't been applied yet. It required some code changes on our end which took a few days. You should be able to expect it to be applied by the end of this week (likely sooner than that). I'll let you know if that's not the case. |
The changes are now in prod to no longer send refresh traffic to the |
What's the state here? I read that sr.ht is still experiencing high traffic caused by GOs "proxy" feature? Just thinking out loud: Can the crawlers not use a redis db or alike and store the url and datetime of last clone which gets cleaned after 24 hours? For a big project like Go, this fix should be trivial and also save resources on GOs side as well. |
Anyone who's receiving too much traffic from proxy.golang.org can request that they be excluded from the refresh traffic, as we did for We did consider caching clones, but it has security implications and adds complexity, so we decided not to. It is certainly not trivial to do and not something we are likely to do based on this issue. Since there hasn't been activity on this issue in nearly a year, I'm going to close it. Anyone who wants to be excluded from refresh traffic can file a new issue. |
[Without any Google hat, since I left the company earlier this month.] I believe the operator of |
This seems to be a reasonable mechanism to provide some control of the refresh rate, rather than a binary choice. |
@tomberek Using robots.txt would indeed scale better than a go-specific, manual exclusion process (ie fill a bug to ask devs to manually update an exclusion list). It's ultimately up to Google and Go devs to pick a process, but one that scale well would be in mutual interest of module hosters and go dev. Also, a heads up : this issue was just featured on YCombinator/HackerNews https://news.ycombinator.com/item?id=31508000 |
I see. I wasn't aware of the blog post or the HN item, thank you. That explains the sudden attention. The proxy performs a mix of user-initiated traffic, for which Subsequent to this bug, there has been exactly one more request to suppress traffic to a host, #51284. It doesn't make sense to me to scale or otherwise improve a process with that little usage. That opinion could of course change if we get more requests. |
Standardization is worth extra work. Putting that aside, however, this could be trivially hacked on by periodically checking the robots.txt of domains in the refresh que, then conditionally adding them to the list.
It being a golang-specific process could contribute to the lack of requests, which would significantly affect this metric. |
Following up from #44468: I run a git hosting service which has, in the past few weeks, received elevated levels of traffic from Google-owned IP addresses performing git clones. The shape of the requests is something like this:
What is the purpose of this traffic?
If it's crawling, it should set an appropriate user-agent and respect robots.txt. The traffic is coming in at a rate which I would not consider reasonable for a crawler, up to several times per second - and git clones are more expensive than other HTTP requests.
The text was updated successfully, but these errors were encountered: