-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: cgocall
is low performance for Conn on Windows
#58336
Comments
There's not really a whole lot of information here- you said you'd previously looked at it in pprof, what did you run with pprof, can you link the output? I'd also consider changing the package in the title to |
Does this reproduce when downloading large files from http? |
I have constructed a short simple proxy, that can help reproduce the issue.
|
When the profile tells you that a lot of time is spent in So it's not clear that this is a bug as such. Of course it would be nice to get more complete information. And it is possible that your program is really not spending much time in non-Go code, in which case there may be a bug somewhere. |
The problem here is in OS level, comparing to other language developed programs.
This is not the answer here, I think. All different languages call the OS network APIs. My descriptions may not cover what you concern, please have a try by yourself. |
It seems like it's possible that WSARecv is making blocking calls which puts the syscall over the 1us timeout to rejoin the P without penalty, which will create a fair amount of context switching when called repeatedly. I believe linux & mac make non-blocking calls for socket read/write and get results via netpoll? I don't know how I would check that this is the case, though. Obviously net.Conn.Read is blocking, but that doesn't say much about syscall.WSARecv. |
Hi @lifenjoiner, do you have numeric CPU usage numbers reported by the OS you could share for your sample program vs maybe shadowsocks-rust or one of the other programs? |
I will say that if WSARecv was non-blocking, I would expect the vast majority of this program to be spent in netpoll rather than in syscall, right? The program is network, the vast majority of its runtime should be waiting rather than acting, so if it's all cgo, then it must be waiting in cgo. |
Measurements by Base info Both buffer size are (should be) 8k.
Results
|
Do you have a tool to measure context switching in there? |
Does waiting consume lots of CPU resource? If you go to Line 157 in 02d8ebd
Currently, no.
|
Context switching rate should be high, referring: It says All data are included in the attachment above. |
No- rather, when go goes into the syscall state, it takes the os thread with it. My understanding is that it has 1us to return before go spins up another os thread to handle the goroutines that were previously being handled by the os thread that was taken for the syscall. The thing is that if there's not enough work to sustain an additional os thread than what it was previously using (which is most cases, since go is generally using the correct number of threads), go will generally keep spinning up and shutting down os threads, causing an obscene amount of context switches, which seems to be what's happening here. I would say that go's handling of goroutines returning from syscall is generally poor when it can't put everything back exactly where it found it, and getting to the point where this code maintained two os threads due to the usage patterns in play would probably be ideal, even though making these calls nonblocking would be a quick fix |
btw there's another case like this (poor decisions about when to spin down os threads cause insane context switches) I'm aware of that isn't cgo related: #21827 In that case, parking an os locked thread with sleep or a blocking channel read will cause the os thread to be spun down since it isn't doing any work, the goroutine is moved to a thread where it can't run, and then the locked thread has to be awoken when the goroutine unparks, causing two context switches. |
@lifenjoiner Could you collect a runtime trace (https://pkg.go.dev/runtime/trace#Start) from this example? Just collect for a few seconds, otherwise you'll probably get a huge trace file. |
@prattmic Many thanks for diving in! Here is a trace file for 30s while downloading is stable. Method:
Append: |
Hey @prattmic I saw that you were assigned #44343 and it is scheduled for 1.21. If the goal is to deal with tasks that are a blocker from Windows users upgrading go to the latest version, this is a big one, and while it raises some interesting high level questions about the way that the scheduler handles things, the task as reported may not be that complicated. Network calls on windows appear to be synchronous and they need to be made asynchronous in order to stop this behavior I believe. I don't see how it's possible to run a web service on windows in Go 1.21 without this task being fixed. |
@lifenjoiner could you run your benchmarks using go tip? Some weeks ago I submitted an improvement that asyncify some udp/tcp operations: CL 463839. |
@qmuntal Thanks for your work! Following are my benchmarks. Base Info
Results
Conclusion Seems there are still more improvements can be done. |
I'm having a hard time interpreting those results. Has there been an improvement on the metric you are using to benchmark this issue? I want to know if CL 463839 goes in the right direction or is something unrelated. |
The screenshots are reports generated by I don't know if CL 463839 is the right direction. My network is not very stable. Maybe you can remeasure them by yourself. Prerequisite
You may choose a URL by your convenient.
Compile them by yourself. All other files and logs are in the attached zip file. Data Files CL463839-Vs-NoPatch_10-samples.zip Analysis |
@CannibalVox @lifenjoiner your initial guess seems to be right, there are a lot of context switches due to the high amount of cgo calls. I haven't seen any low hanging fruit in the |
I don't understand, I thought sub-1us cgo calls don't cause context switches |
Also #19574 didn't repro when I tried it a few months ago |
@qmuntal Thanks for your efforts! Hoping someone will make more progress. |
Change https://go.dev/cl/562915 mentions this issue: |
@lifenjoiner CL 562915 might fix your issue. It removes the To build a Go application using a specific commit, follow these instructions:
|
Seems better. Great work! @qmuntal |
!!! This seems like a really big deal, @qmuntal, thank you very much! |
Windows syscall.SyscallN currently calls lockOSThread for every syscall. This can be expensive and produce unnecessary context switches, especially when the syscall is called frequently under high contention. The lockOSThread was necessary to ensure that cgocall wouldn't reschedule the goroutine to a different M, as the syscall return values are reported back in the M struct. This CL instructs cgocall to copy the syscall return values into the the M that will see the caller on return, so the caller no longer needs to call lockOSThread. Updates #58336. Cq-Include-Trybots: luci.golang.try:gotip-windows-arm64,gotip-windows-amd64-longtest Change-Id: If6644fd111dbacab74e7dcee2afa18ca146735da Reviewed-on: https://go-review.googlesource.com/c/go/+/562915 Reviewed-by: Alex Brainman <alex.brainman@gmail.com> Auto-Submit: Emmanuel Odeke <emmanuel@orijtech.com> Reviewed-by: Than McIntosh <thanm@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Michael Pratt <mpratt@google.com> Run-TryBot: Emmanuel Odeke <emmanuel@orijtech.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
If I read your charts correctly, CL 562915 reduced User Time from ~6% to ~2% and Privileged Time from ~50% to ~6%. Given that the Privileged Time is now way less than the high-context-switch threshold (40% according to #58336 (comment)), IMO we can close this issue. |
Windows syscall.SyscallN currently calls lockOSThread for every syscall. This can be expensive and produce unnecessary context switches, especially when the syscall is called frequently under high contention. The lockOSThread was necessary to ensure that cgocall wouldn't reschedule the goroutine to a different M, as the syscall return values are reported back in the M struct. This CL instructs cgocall to copy the syscall return values into the the M that will see the caller on return, so the caller no longer needs to call lockOSThread. Updates golang#58336. Cq-Include-Trybots: luci.golang.try:gotip-windows-arm64,gotip-windows-amd64-longtest Change-Id: If6644fd111dbacab74e7dcee2afa18ca146735da Reviewed-on: https://go-review.googlesource.com/c/go/+/562915 Reviewed-by: Alex Brainman <alex.brainman@gmail.com> Auto-Submit: Emmanuel Odeke <emmanuel@orijtech.com> Reviewed-by: Than McIntosh <thanm@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Michael Pratt <mpratt@google.com> Run-TryBot: Emmanuel Odeke <emmanuel@orijtech.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
Yes. It is long standing.
What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
I think this is a new case of #19574, taking effect on higher level
Conn
and specific platform Windows. It is suggested to be a new issue.This is a long term user observation.
It is traced into
cgocall
bypprof
(high percentage): on Windows,Conn
Read
/Write
finally calls APIWSARecv
/WSASend
accordingruntime.cgocall
.It is a very basic functionality for web servers, proxies, and other network data transferring tools.
Low performance will pop up when there is a large amount of data, with high CPU usage (for example 10%+, task manager, OS level) which could be 10+ times than rust/c developed similar apps.
Since the scenario is network data transferring, I don't have a simple/short example that can reproduce the issue . Anyone who can generate it is appreciated.Proxies are the most typical case (both
Read
/Write
). My example is bellow in the reply.Test existing programs if you are interested
Maybe we can get started with some high rated open source proxies for now.Here I used the client as a local proxy without any upstream proxies, and wget (single thread) to download via it.
Attention: downloading self hosted resources on the same machine doesn't help reproduce the difference. Loopback MSS = 65495 is adopted.
go: xray, v2ray, and glider.
Config file: direct.json.zip.
rust: shadowsocks-rust.
Mitigation
An alternative way to Read/Write directly is using a big enough buffer. By piling up packets data and then Read/Write in one time, it definitely reduces the context switching costs. But, it has disadvantages and does not 100% work:
My experimental local proxy with big buffer.
I have tried
ReadFrom
, and observed 4k length packets are transferred on loopback interface, triggering high CPU usage.What did you expect to see?
Lower CPU usage. OS level.
What did you see instead?
High CPU usage.
The text was updated successfully, but these errors were encountered: