Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sync/atomic: Pointer 4 times peformance degradation based on layout of structs #67764

Closed
dkropachev opened this issue Jun 1, 2024 · 11 comments
Closed
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance

Comments

@dkropachev
Copy link

Go version

go version go1.22.3 linux/amd64

Output of go env in your module/workspace:

GO111MODULE='on'
GOARCH='amd64'
GOBIN=''
GOCACHE='/home/dmitry.kropachev/.cache/go-build'
GOENV='/home/dmitry.kropachev/.config/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFLAGS=''
GOHOSTARCH='amd64'
GOHOSTOS='linux'
GOINSECURE=''
GOMODCACHE='/home/dmitry.kropachev/go/pkg/mod'
GONOPROXY=''
GONOSUMDB=''
GOOS='linux'
GOPATH='/home/dmitry.kropachev/go'
GOPRIVATE=''
GOPROXY='https://proxy.golang.org,direct'
GOROOT='/home/dmitry.kropachev/go/go1.22.3'
GOSUMDB='sum.golang.org'
GOTMPDIR=''
GOTOOLCHAIN='auto'
GOTOOLDIR='/home/dmitry.kropachev/go/go1.22.3/pkg/tool/linux_amd64'
GOVCS=''
GOVERSION='go1.22.3'
GCCGO='gccgo'
GOAMD64='v1'
AR='ar'
CC='gcc'
CXX='g++'
CGO_ENABLED='1'
GOMOD='/extra/dkropachev/go-auxiliaries/benchs/mutex-in-struct/go.mod'
GOWORK=''
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -m64 -pthread -Wl,--no-gc-sections -fmessage-length=0 -ffile-prefix-map=/tmp/go-build1846860344=/tmp/go-build -gno-record-gcc-switches'

What did you do?

Consider following snippet

type State struct {
	l []string
	i int
}

type Mutex[T any] struct {
	mut sync.RWMutex
	val atomic.Pointer[T]
}

func (d *Mutex[T]) Read() T {
	return *d.val.Load()
}

By some wierd reason this code gives 5x performance degradation in parallel Mutex.Read workload comparing to following code:

type State struct {
	l []string
}

type Mutex[T any] struct {
	mut sync.RWMutex
	val atomic.Pointer[T]
}

func (d *Mutex[T]) Read() T {
	return *d.val.Load()
}

Or even to this code:
code:

type State struct {
	l []string
	i int
}

type Mutex[T any] struct {
	mut sync.RWMutex
	val atomic.Pointer[T]
}

func (d *Mutex[T]) Read() T {
	return *d.val.Load()
}

In short, somehow existance of [] string and i int in a State in conjunction with having sync.RWMutex on Mutex gives 5x pefromance degradation in parallel atomic.Pointer.Load workload.
If you to add variable to either State or Mutex or replace sync.RWMutex with sync.Mutex this perfomance degradation vanishes.

This test produced consistent results:

# go test -v ./... -bench=. -benchmem
goos: linux
goarch: amd64
pkg: github.com/go-auxiliaries/perf-geek/mutex-in-struct/test
cpu: 12th Gen Intel(R) Core(TM) i9-12900HK
BenchmarkMutexWith_State
BenchmarkMutexWith_State-12             1000000000               0.4482 ns/op          0 B/op          0 allocs/op
BenchmarkMutexWith_State2
BenchmarkMutexWith_State2-12            1000000000               0.07442 ns/op         0 B/op          0 allocs/op
BenchmarkMutexWithInt_State
BenchmarkMutexWithInt_State-12          1000000000               0.07484 ns/op         0 B/op          0 allocs/op
BenchmarkMutexWithInt_State2
BenchmarkMutexWithInt_State2-12         1000000000               0.07413 ns/op         0 B/op          0 allocs/op
PASS
ok      github.com/go-auxiliaries/perf-geek/mutex-in-struct/test        0.745s

What did you see happen?

Performance of atomic.Pointer.Load() degrades

What did you expect to see?

No performance degradation4

@dkropachev
Copy link
Author

dkropachev commented Jun 1, 2024

More info on it here.
Also want to point out that on asm, as expected, code looks the same.

@randall77
Copy link
Contributor

I get different behavior on my machine.

goos: darwin
goarch: arm64
BenchmarkMutexWith_State-24        	1000000000	         0.1423 ns/op
BenchmarkMutexWith_State2-24       	1000000000	         0.1620 ns/op
BenchmarkMutexWithInt_State-24     	1000000000	         0.1389 ns/op
BenchmarkMutexWithInt_State2-24    	1000000000	         0.04106 ns/op

My guess is this has a lot to do with how allocations land within cache lines and whether any false sharing ensues. It might be instructive to make sure that atomic.Pointers are flanked on either side by a cache line of padding, and see if that changes anything.

The difference in speed goes away with GOMAXPROCS=1.

@Jorropo Jorropo changed the title atomic.Pointer: 4 times pefromance degradation in a weird case atomic: Pointer 4 times peformance degradation based on layout of structs Jun 2, 2024
@Jorropo
Copy link
Member

Jorropo commented Jun 2, 2024

I had a not as bad regression:

goos: linux
goarch: amd64
cpu: AMD Ryzen 7 5825U with Radeon Graphics         
BenchmarkMutexWith_State-16        	1000000000	         0.1842 ns/op
BenchmarkMutexWith_State2-16       	1000000000	         0.07826 ns/op
BenchmarkMutexWithInt_State-16     	1000000000	         0.07868 ns/op
BenchmarkMutexWithInt_State2-16    	1000000000	         0.07899 ns/op
PASS
ok  	command-line-arguments	0.470s

Adding a cache-line-pad fixed it:

 type Mutex[T any] struct {
         mut sync.RWMutex
+        _ [1]byte
         val atomic.Pointer[T]
 }
goos: linux
goarch: amd64
cpu: AMD Ryzen 7 5825U with Radeon Graphics         
BenchmarkMutexWith_State-16        	1000000000	         0.07821 ns/op
BenchmarkMutexWith_State2-16       	1000000000	         0.07985 ns/op
BenchmarkMutexWithInt_State-16     	1000000000	         0.07818 ns/op
BenchmarkMutexWithInt_State2-16    	1000000000	         0.07892 ns/op
PASS
ok  	command-line-arguments	0.356s

You might need a different size of array based on your CPU model or how the winds are blowing.

A micro benchmark like this is extremely narrow and doesn't capture accurately the complete cost of .Read as this will depend a lot on how code around this is used, here given you drop the value this is a "best" case scenario as there are no dependencies on the memory reads.


I don't think we want to fix anything here, adding extra padding is extremely runtime-edge-case-specific, makes structs bigger (slower to alloc, use more memory).

I think it's fine for users to fix alignment issues, see the discussion to use structs package to specify alignment of a field.

@seankhliao seankhliao added Performance NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Jun 2, 2024
@seankhliao seankhliao changed the title atomic: Pointer 4 times peformance degradation based on layout of structs sync/atomic: Pointer 4 times peformance degradation based on layout of structs Jun 2, 2024
@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jun 2, 2024
@dkropachev
Copy link
Author

dkropachev commented Jun 2, 2024

@randall77 , @Jorropo , thanks a lot for looking at it.

@Jorropo, could you please point me to the discussion about using structs package to specify alignment of a field ?

@Jorropo
Copy link
Member

Jorropo commented Jun 2, 2024

I don't have links on top of my head, it was in slack or discord.
Some people are interested in something like this:

 type Mutex[T any] struct {
 mut sync.RWMutex
 _ structs.Align[32]
 val atomic.Pointer[T]
}

altho the question of type Align[C const int] Align[C] what a const int generic constrain / parameter means hasn't been solved yet, and other stuff.

In the meanwhile you can use _ [32]byte, struct.Align would be better because it would dynamically scale the padding in order to meet alignment.
To give an example if your cache size / and memory cache resolution protocols is based on 32 bytes, if you have concurrent writes within a 32 bytes cache line, this would cause false sharing and be slow, _ [32]byte would add a 32 bytes buffer and always work, however it is ineficient, because if you are already using 24bytes in the first cacheline, struct.Align[32] would only add the 8 bytes of padding required for the next field to be 32 bytes aligned.

@randall77
Copy link
Contributor

#19057
Probably modified go together with #66408

@mknyszek
Copy link
Contributor

mknyszek commented Jun 5, 2024

In triage, we don't think there's anything for us to do here on the Go side. Closing optimistically, feel free to comment if you disagree.

@mknyszek mknyszek closed this as not planned Won't fix, can't repro, duplicate, stale Jun 5, 2024
@dkropachev
Copy link
Author

@Jorropo , @mknyszek , @randall77, one last question, just to make sure that issue is understood in the full.
This memory chaching+locking issue plays same role in the fact that if I change State structure, performance issue goes away? say I add i int to it, not to the Mutex, but to the State.

@randall77
Copy link
Contributor

Anything that might change the layout of data in memory can affect the performance. Adding a field or removing a field to any object that is allocated could in theory trigger slowness (or fastness).

That said, I don't think we have proof that it is the cause. It just seems likely. If it is the case, there's nothing that Go can do about it - that's a property of the hardware. Careful alignment to avoid false sharing is probably the only solution. If it isn't the case, we'd need to see some evidence. Probably something showing where the time is going using the chip performance counters (cache misses, evictions, that sort of thing).

@dkropachev
Copy link
Author

@randall77 , thanks for clarification, can I ask you not to close this issue at least for two weeks, I will try to get more data, if it is a false sharing issue this data can contribute to solution.

@randall77
Copy link
Contributor

The issue is closed, but it isn't deleted. Feel free to reopen if you have more data that points to an actionable change Go can make.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Projects
Development

No branches or pull requests

6 participants