-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: performance problem with many Cgo calls #19574
Comments
I haven't looked at the code. Is this all calls from Go into C? Are there any callbacks from C to Go? You say that the "Cgo call overheads are negligible in comparison to the cost of committing the batch to RocksDB". Combining that with the number of operations being done per second, would it be accurate to say that the C code itself, after called from Go, takes more than 20 microseconds to run? In Go 1.8 when a goroutine calls into C, it is still holding a GOMAXPROCS slot and blocking other goroutines from running. Only if it is running in C for more than 20 microseconds or so will the system monitor thread decide that it is blocked in C code and activate another goroutine. The fact that your system performs better than you increase GOMAXPROCS makes me suspect that there is something to improve in that area of the code. |
The code is all calls from Go into C. There are no callbacks from C to Go. This is true of both the reproduction code and CockroachDB.
Yes. The actual RocksDB write batch operation involves a disk write, sometimes followed by an fsync.
I've been perusing that Go scheduler code. The 20us time is the minimum time sysmon will sleep for. The maximum delay is 10ms and a small bit of instrumentation (measuring the time around the Changing the max
Unfortunately, this didn't result in any improvement to the |
Let me chime in. ( I came from #21827 (comment) and the sympthoms are similar to what is described here so I decided to continue this thread instead of creating new issue. Please correct me if I am wrong. ) First some context: Similarly to Cockroach who is using CGo wrappers for RocksDB I observed performance anomalies while using CGo wrappers for SQLite where presense of other goroutines combined with Cgo calls on "main" one show big overhead: github.com/mattn/go-sqlite3 uses CGo and performs several CGo calls inside one Query or Exec. There was also an interrupt goroutine spawned for every Query or Exec to call With Go1.10 avoiding to spawn that interrupt goroutine, if we know the context cannot be canceled, brings ~ 1.5x speedup to Exec (mattn/go-sqlite3#530). However Query was made faster only by 5% because after 2b283ce database/sql itself always spawns a goroutine to close Rows on context or transaction cancel. Avoiding to spawn The theme here is that CGo calls show much higher overhead when performed not under simple only-one gorouitine scenario but in the presence of other goroutines. So let's make some analysis: First, on my machine I can observe that under simple 1-goroutine-only scenario the cost of making 1 trivial CGo call is ~ 60ns, it scales linearly to the number of calls, it is independent of whether GOMAXPROCS is 1 or not, and it is ~ 35x slower compared to making a trivial Go call:
Then we can check what happens in the presence of other goroutines by simulating client-server requests over channel with client and server running in its own goroutines and server sequentially handling requests (so 2 goroutines in total) via making 1 trivial either Go or CGo call:
what can be seen here is that for GOMAXPROCS=1 δ(Go1,CGo1) ≈ t(CGo call), both Go and CGo RTT times are bigger for GOMAXPROCS!=1 with δ(Go1-4,CGo1-4) also growing to ~80ns. It is somewhat understandable that processing time increases with GOMAXPROCS!=1 even for Go-only case because sometimes there now can be inter OS-threads communications, so in the context of this issue there is probably no problem here (though those 2 goroutines run almost not overlapping with each other and thus there should not be a need to put them to different OS threads). However δ(Go1,CGo1) increase over GOMAXPROCS=1 case suggests there might be something else going on. 20ns is however small and might be all noise, so let's recheck what happens when server performs 10 calls in the handler:
Go10, Go10-4 and CGo10 show time increase proportional to ~ 9·t(corresponding call). However CGo10-4 (i.e. CGo case with GOMAXPROCS!=1) is now 2x slower than CGo10 (i.e. the same case with GOMAXPROCS=1) by ~1µs. The difference in between CGo10 and CGo10-4 can be attributed to inter OS-thread communications: the strace for CGo10 case is almost empty and contains only 10ms sleeps from sysmon during active phase:
However CGo10-4 shows different picture under strace - during active phase there is now constant high-rate churn of futex wait and wake in between 2 OS threads with 1-10 millisecond sleeps from sysmon also showing seldomly:
This strace suggests that for CGo10-4 the scheduler somehow decided to keep one goroutine on one OS thread and another goroutine on another OS thread and then channel communications in between them somehow becomes futex ops which bring the overhead. The draft notesleep spinning patch from #21827 (comment) can mitigate the CGo10-4 problem somewhat:
However it is not clear why the fact of making 10 CGo calls in sequence affected the scheduler to make this decision: the individual CGo call time is well under 20µs - the time, if I understood correctly, after which sysmon decides that something is hanging in a CGo or syscall for too much and removes current OS thread from a GOMAXPROCS slot. The fact that sysmon is sleeping for >1ms in strace also confirm that sysmon does not intervene here. I do not know Go scheduler insides, but clearly making several CGo call in sequence shakes it enough to bring some disorder and slowdown. For the reference when I was seeing the problem with SQLite my query time was ~ 5-6µs if I use the go driver directly and with standard go func() {
select {
case <-ctx.Done():
// call some cancel
case <-done:
// work finished ok
}
}()
// perform query
close(done) spawned around query with ctx never being canceled it was increasing the time to ~10µs. In other words it can be not only 1µs of additional penalty but higher. In summary, like @petermattis said it looks like CGo calls sequence trigers some badness in Go scheduler that would be good to understand and hopefully fix. Thanks beforehand, /cc @ianlancetaylor, @dvyukov, @rsc, @aclements, @bcmills, @mattn (cgo.go) package cgotest
// int argadjust(int arg) { return 3 + arg; }
import "C"
// XXX here because cannot use C in tests directly
func cargadjust(arg int) int {
return int(C.argadjust(C.int(arg)))
} (cgo_test.go) package cgotest
import "testing"
//go:noinline
func goargadjust(i int) int {
return i + 3
}
// BenchmarkCallGo1 measures overhead of 1 Go call.
func BenchmarkCallGo1(b *testing.B) {
for i := 0; i < b.N; i++ {
x := goargadjust(i)
if x != i + 3 {
b.Fatalf("goargadjust -> %d ; want %d", x, i + 3)
}
}
}
// BenchmarkCallGo10 measures overhead of 10 Go calls performed sequentially.
func BenchmarkCallGo10(b *testing.B) {
for i := 0; i < b.N; i++ {
for j := 0; j < 10; j++ {
x := goargadjust(i)
if x != i + 3 {
b.Fatalf("goargadjust -> %d ; want %d", x, i + 3)
}
}
}
}
// BenchmarkCallCGo1 measures overhead of 1 CGo call.
func BenchmarkCallCGo1(b *testing.B) {
for i := 0; i < b.N; i++ {
x := cargadjust(i)
if x != i + 3 {
b.Fatalf("cargadjust -> %d ; want %d", x, i + 3)
}
}
}
// BenchmarkCallCGo10 measures overhead of 10 CGo calls performed sequentially.
func BenchmarkCallCGo10(b *testing.B) {
for i := 0; i < b.N; i++ {
for j := 0; j < 10; j++ {
x := cargadjust(i)
if x != i + 3 {
b.Fatalf("cargadjust -> %d ; want %d", x, i + 3)
}
}
}
}
// ---------
type req struct {
respq chan *resp
arg int
}
type resp struct {
ret int
}
// client performs one request-reply cycle to a server over srvlink.
func client(srvlink chan *req, arg int) int {
rc := make(chan *resp)
srvlink <- &req{
respq: rc,
arg: arg,
}
ret := <-rc
return ret.ret
}
// server receives requests over inq, passes received request to
// handler, and returns result back over response channel.
//
// server runs until it processes all requests from inq.
type server func(inq chan *req, handler func(int) int)
// seqServer sequentially handles requests received over inq.
func seqServer(inq chan *req, handler func(int) int) {
for r := range inq {
r.respq <- &resp{ret: handler(r.arg)}
}
}
// goServer concurently handles requests received over inq.
func goServer(inq chan *req, handler func(int) int) {
for r := range inq {
r := r
go func() {
r.respq <- &resp{ret: handler(r.arg)}
}()
}
}
// benchmarkRTT measures client-server round-trip latency for a particular
// server implementation.
func benchmarkRTT(b *testing.B, srv func(chan *req)) {
inc := make(chan *req)
go srv(inc)
for i := 0; i < b.N; i++ {
client(inc, i)
}
close(inc)
}
// --------
// mkServerGo1 returns function that will runs srv with N Go calls per request.
func mkServerGo1N(srv server, n int) func(chan *req) {
return func(inq chan *req) {
srv(inq, func(arg int) int {
for i := 0; i < n; i++ {
arg = goargadjust(arg)
}
return arg
})
}
}
// mkServerCGoN returns function that will run srv with N CGo calls per request.
func mkServerCGoN(srv server, n int) func(chan *req) {
return func(inq chan *req) {
srv(inq, func(arg int) int {
for i := 0; i < n; i++ {
arg = cargadjust(arg)
}
return arg
})
}
}
func BenchmarkRTTSeqServerGo1(b *testing.B) { benchmarkRTT(b, mkServerGo1N(seqServer, 1)) }
//func BenchmarkRTTGoServerGo1(b *testing.B) { benchmarkRTT(b, mkServerGo1(goServer)) }
func BenchmarkRTTSeqServerCGo1(b *testing.B) { benchmarkRTT(b, mkServerCGoN(seqServer, 1)) }
//func BenchmarkRTTGoServerCGo1(b *testing.B) { benchmarkRTT(b, mkServerCGoN(goServer, 1)) }
func BenchmarkRTTSeqServerGo10(b *testing.B) { benchmarkRTT(b, mkServerGo1N(seqServer, 10)) }
func BenchmarkRTTSeqServerCGo10(b *testing.B) { benchmarkRTT(b, mkServerCGoN(seqServer, 10)) }
//func BenchmarkRTTGoServerCGo10(b *testing.B) { benchmarkRTT(b, mkServerCGoN(goServer, 10)) } |
From run time perspective both RTTSeqServerCGo1 and RTTSeqServerCGo10 do exactly the same -- call a long sequences of cgo calls. From run time perspective the inner loop does not make any difference. |
@dvyukov thanks for feedback.
There is a difference - CGo1 case is like this:
while CGo10 is like:
every CGo call triggers at least runtime.entersyscall / runtime.exitsyscall so there is a difference in runtime call sequences in between CGo1 and CGo10 - channel sends operations are also there and the difference is in how many cgo calls are made in between server receives a request and sends a reply for it back. I was running the tests with The only mistake from me in previous runs was that the turbo was not disabled so now to recheck I disable it with
i.e. the timings are stable, there is no outliers and it is
this covers "scaling/turbo/overheat and "running for longer". Now let's check setting affinity. First: just setting GOMAXPROCS=2 does not make much difference to default GOMAXPROCS=4:
but CGo10-2 overhead goes away if I set test process affinity to only two first CPUs:
probably because setting affinity removes many context switches:
and it was already discussed in the second half of #21827 (comment) how HOWEVER Setting affinity cannot work in practice - if we imagine a server which for every request spawns separate goroutine, and the CGo calls have to be done from under that request-handling goroutine, we now have N(goroutines) > Ncpu. Since we cannot set affinity of a goroutine - at least we cannot do so at present with public API - the only choise is to let the whole process use all availble CPU resources dedicated to service and oops -> context switches could be happenning all the time. |
Ah, sorry. I misread the test. I thought it's just cgo calls. |
np, and thanks for feedback. Looking forward to what might be done about it. |
There were several projects which were being rewritten in C or C++, because of cgo overhead. Can somebody try to work on this? Thank you. |
Is there any update on this? I'm also having this problem too. For my use case, I'm using https://github.com/cisco/openh264, which is written in C++, to encode every frame that I get from my camera through https://en.wikipedia.org/wiki/Video4Linux. |
There are no updates to this issue. If there are updates, they will be reported here. Note that your profile is not showing you that |
Ah I see, I'm sorry for the misunderstanding. I'll try to import github.com/ianlancetaylor/cgosymbolizer. Thanks. |
FYI I ran the test in the OP on 1.17 & cgo performance roughly matches go performance (and may be slightly better?) |
I also ran navytux's test, and the cgo overhead does still roughly double when GOMAXPROCS>1 when there are go operations in flight for his RTT tests, but no evidence of anything with as large of impact as the OP. |
I do think the overhead, such as it is, is likely related to #21827. Goroutines that interact with cgo are likely to end up, via brownian motion, in the same position as an OS-locked thread: a single G on a single M/P. When the thread parks to collect info, its M is parked in turn. We don't end up in this case when GOMAXPROCS=1, because cgo can't subtly move the G onto its own M/P, it has to keep squishing the cgo G back into the single P when it comes back from syscall, so the subsequent G park when it reads off channel doesn't cause problems. I think broadly there's an issue where parking an M is intended to happen when there is clearly not enough work around to sustain an entire M, but there are a few cases (os-locked threads, threads that talk to cgo a lot) where there's good reason for a G to be hanging out on its own. It would probably be better if os-locked threads never parked their M and then we could os-lock threads if we are talking to cgo a lot and have a problem with this behavior. EDIT: Alternatively go could just be way more amenable to cramming G's coming back from syscall into an existing P but the performance impact of that seems hard to predict. |
Here is a new scenario that make me concern the performance a lot: Low performance will pop up when there is a large amount of data, with high CPU usage (10%+, task manager) which could be 10+ times than rust/c developed similar apps. An alternative way to Read/Write directly is using a big enough buffer. By piling up packets data and then Read/Write in one time, it definitely reduces the context switching costs. But, it has disadvantages and does not 100% work:
It is really a pity to see it is unplanned! |
Would probably help if you got a self-contained example like in the OP and opened a new issue. The example in this one no longer produces the results linked in the OP, so I would not expect to see this task addressed, but your new one might be. |
@CannibalVox |
Please answer these questions before submitting your issue. Thanks!
What version of Go are you using (
go version
)?go version go1.8 darwin/amd64
, though I've also verified this happens on Linux.What operating system and processor architecture are you using (
go env
)?Background
CockroachDB internally uses RocksDB (a fork of LevelDB) for data storage. We access RocksDB via a small Cgo wrapper. All writes to RocksDB are performed using
WriteBatches
which are first written to a write-ahead-log and then inserted into the database. Internally, RocksDB "groups" multiple concurrent write batches together into a single append to the write-ahead-log. While investigating performance, we noticed that the size of the groups was smaller than expected.What did you do?
TL;DR? If write batch grouping is performed in Go performance is good. If it is performed in C++ (either by code that is nearly identical to the Go grouping or by RocksDB) performance is bad.
cockroachdb/cockroach#14138 started as an experiment to replace the grouping of write batches in RocksDB with our own grouping in Go to try and understand the smaller than expected group sizes. In addition to fixing the group sizes, it improved performance on one experiment by 100% while reducing latencies. This was unexpected as the Cgo call overheads are negligible in comparison to the cost of committing the batch to RocksDB. In order to point the finger at RocksDB, we reimplemented the grouping of write batches in our Cgo RocksDB wrapper. Performance was equivalent to RocksDB performing the grouping.
I've provided a stripped down reproduction scenario at https://github.com/petermattis/batchtest. Running on my Mac laptop I see:
By default,
batchtest
uses 100 concurrent worker threads writing "batches". My suspicion is thatbatchtest
and CockroachDB are tickling some badness in the Go scheduler. If I setGOMAXPROCS
to the number of concurrent workers the cgo performance gets much closer to the Go performance:cockroachdb/cockroach#14138 is an acceptable workaround for committing batches, but it would be great to understand the performance discrepancy here. We're concerned about other Cgo operations in CockroachDB that don't have such easy workarounds.
The text was updated successfully, but these errors were encountered: