Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Go 1.23 timer changes can lead to hangs #5868

Closed
dany74q opened this issue Oct 3, 2024 · 8 comments
Closed

Go 1.23 timer changes can lead to hangs #5868

dany74q opened this issue Oct 3, 2024 · 8 comments
Labels
bug Something isn't working
Milestone

Comments

@dany74q
Copy link
Contributor

dany74q commented Oct 3, 2024

Description

Go 1.23 timer semantic changes the timer channels to be non-buffered;
One side effect of this is that now, trying to drain timer channels after calling t.Stop() will always block.

We've observed occasional hangs coming from the timer channel drain in batch_span_processor.go:

func (bsp *batchSpanProcessor) processQueue() {
[...]
if shouldExport {
	if !bsp.timer.Stop() {
		<-bsp.timer.C // <<- This hangs in go 1.23 w/ the default asynctimerchan=0 
	}
goroutine 1 [select, 23 minutes]:
go.opentelemetry.io/otel/sdk/trace.(*batchSpanProcessor).Shutdown.func1()
        /tmp/go/pkg/mod/go.opentelemetry.io/otel/sdk@v1.24.0/trace/batch_span_processor.go:170 +0x108
sync.(*Once).doSlow(0x40001e4120?, 0x4013d17000?)
        /usr/local/go/src/sync/once.go:76 +0xf8
sync.(*Once).Do(...)
        /usr/local/go/src/sync/once.go:67
go.opentelemetry.io/otel/sdk/trace.(*batchSpanProcessor).Shutdown(0x40001b8058?, {0x1cec97c8?, 0x2e7794c0?})
        /tmp/go/pkg/mod/go.opentelemetry.io/otel/sdk@v1.24.0/trace/batch_span_processor.go:156 +0x60

goroutine 1096971 [semacquire, 23 minutes]:
sync.runtime_Semacquire(0x7365756c6176222c?)
        /usr/local/go/src/runtime/sema.go:71 +0x2c
sync.(*WaitGroup).Wait(0x4005ddd790)
        /usr/local/go/src/sync/waitgroup.go:118 +0x74
go.opentelemetry.io/otel/sdk/trace.(*batchSpanProcessor).Shutdown.func1.1()
        /tmp/go/pkg/mod/go.opentelemetry.io/otel/sdk@v1.24.0/trace/batch_span_processor.go:161 +0x50
created by go.opentelemetry.io/otel/sdk/trace.(*batchSpanProcessor).Shutdown.func1 in goroutine 1
        /tmp/go/pkg/mod/go.opentelemetry.io/otel/sdk@v1.24.0/trace/batch_span_processor.go:159 +0xc8

goroutine 151 [chan receive, 38 minutes]:
go.opentelemetry.io/otel/sdk/trace.(*batchSpanProcessor).processQueue(0x4005ddd720)
        /tmp/go/pkg/mod/go.opentelemetry.io/otel/sdk@v1.24.0/trace/batch_span_processor.go:330 +0x2dc
go.opentelemetry.io/otel/sdk/trace.NewBatchSpanProcessor.func1()
        /tmp/go/pkg/mod/go.opentelemetry.io/otel/sdk@v1.24.0/trace/batch_span_processor.go:128 +0x54
created by go.opentelemetry.io/otel/sdk/trace.NewBatchSpanProcessor in goroutine 1
        /tmp/go/pkg/mod/go.opentelemetry.io/otel/sdk@v1.24.0/trace/batch_span_processor.go:126 +0x2b8

Environment

  • OS: linux
  • Architecture: arm64
  • Go Version: 1.23
  • opentelemetry-go version: v1.24.0
@dany74q dany74q added the bug Something isn't working label Oct 3, 2024
dany74q added a commit to dany74q/opentelemetry-go that referenced this issue Oct 3, 2024
- Fix timer channel drain to avoid hanging in go1.23 w/ asynctimerchan=0
@dany74q
Copy link
Contributor Author

dany74q commented Oct 3, 2024

We can fix this by select-ing on the channel:
#5869

@github-project-automation github-project-automation bot moved this to Needs triage in Go: Triage Oct 3, 2024
@pellared pellared moved this from Needs triage to High priority in Go: Triage Oct 3, 2024
@MrAlias MrAlias added this to the v1.31.0 milestone Oct 3, 2024
dmathieu added a commit that referenced this issue Oct 4, 2024
- Fix timer channel drain to avoid hanging in go1.23 w/ asynctimerchan=0
(#5868)

---------

Co-authored-by: Robert Pająk <pellared@hotmail.com>
Co-authored-by: Damien Mathieu <42@dmathieu.com>
@dmathieu
Copy link
Member

dmathieu commented Oct 4, 2024

Fixed in #5869.

@dmathieu dmathieu closed this as completed Oct 4, 2024
@github-project-automation github-project-automation bot moved this from High priority to Closed in Go: Triage Oct 4, 2024
dashpole added a commit that referenced this issue Oct 11, 2024
### Added

- Add `go.opentelemetry.io/otel/sdk/metric/exemplar` package which
includes `Exemplar`, `Filter`, `TraceBasedFilter`, `AlwaysOnFilter`,
`HistogramReservoir`, `FixedSizeReservoir`, `Reservoir`, `Value` and
`ValueType` types. These will be used for configuring the exemplar
reservoir for the metrics sdk. (#5747, #5862)
- Add `WithExportBufferSize` option to log batch processor.(#5877)

### Changed

- Enable exemplars by default in `go.opentelemetry.io/otel/sdk/metric`.
Exemplars can be disabled by setting
`OTEL_METRICS_EXEMPLAR_FILTER=always_off` (#5778)
- `Logger.Enabled` in `go.opentelemetry.io/otel/log` now accepts a newly
introduced `EnabledParameters` type instead of `Record`. (#5791)
- `FilterProcessor.Enabled` in
`go.opentelemetry.io/otel/sdk/log/internal/x` now accepts
`EnabledParameters` instead of `Record`. (#5791)
- The `Record` type in `go.opentelemetry.io/otel/log` is no longer
comparable. (#5847)
- Performance improvements for the trace SDK `SetAttributes` method in
`Span`. (#5864)
- Reduce memory allocations for the `Event` and `Link` lists in `Span`.
(#5858)
- Performance improvements for the trace SDK `AddEvent`, `AddLink`,
`RecordError` and `End` methods in `Span`. (#5874)

### Deprecated

- Deprecate all examples under `go.opentelemetry.io/otel/example` as
they are moved to [Contrib
repository](https://github.com/open-telemetry/opentelemetry-go-contrib/tree/main/examples).
(#5854)

### Fixed

- The race condition for multiple `FixedSize` exemplar reservoirs
identified in #5814 is resolved. (#5819)
- Fix log records duplication in case of heterogeneous resource
attributes by correctly mapping each log record to it's resource and
scope. (#5803)
- Fix timer channel drain to avoid hanging on Go 1.23. (#5868)
- Fix delegation for global meter providers, and panic when calling
otel.SetMeterProvider. (#5827)
- Change the `reflect.TypeOf` to use a nil pointer to not allocate on
the heap unless necessary. (#5827)
@ash2k
Copy link
Contributor

ash2k commented Oct 11, 2024

I don't think there was a change in semantics - that would have broken the Go 1 compatibility promise. This program works as expected: https://go.dev/play/p/KGLGXCurLA_o.

What you saw was probably this bug. Go 1.23.0 and 1.23.1 are affected, 1.23.2 has the fix.

@pellared
Copy link
Member

@ash2k

I don't think there was a change in semantics - that would have broken the Go 1 compatibility promise.

See:

This program works as expected: https://go.dev/play/p/KGLGXCurLA_o.

In your snippet t.Stop() should never return false (thanks to the bugfix you mentioned). However, in case using the timer in multiple goroutines a deadlock can still happen e.g. https://go.dev/play/p/vEFtZgmzU-s. The other possibility is that the same goroutine calls t.Stop twice e.g. https://go.dev/play/p/dgIVo7SM3iv.

@ash2k
Copy link
Contributor

ash2k commented Oct 12, 2024

I think both examples you provided are buggy use of timers. I'd expect them to deadlock in any version. 🤷

@pellared
Copy link
Member

pellared commented Oct 12, 2024

I think both examples you provided are buggy use of timers. I'd expect them to deadlock in any version. 🤷

@ash2k, you are right. They deadlock in Go 1.22 as well. I only see a difference between Go 1.22 and Go 1.23 in the output for the following example https://go.dev/play/p/iVkFnd5gVc8.

After experimenting and reading a few times it looks like we could indeed revert the bugfix thanks to the fix for golang/go#69312.

On the other side, the current code is still correct and more defensive (e.g. works for people who have not patched Go 1.23). Therefore, maybe it is better to leave it as it is.

Do you have any preference/suggestion on what we should do and why?

Personally, I have no strong preference.

PS. Thanks a lot for taking your time to needed to give us feedback. I truly appreciate your effort.

@ash2k
Copy link
Contributor

ash2k commented Oct 12, 2024

Do you have any preference/suggestion on what we should do and why?

I don't think there is a lot of sense to keep the workaround - there is so much code using timers that is broken by the Go bug, that having a workaround in a single place doesn't make sense. Also, looks like the current timer code is buggy if used with Go 1.22 and needs to be reworked anyway:

ForceFlush() can be called at any time. It calls exportSpans() in a new goroutine:

func (bsp *batchSpanProcessor) ForceFlush(ctx context.Context) error {
// Interrupt if context is already canceled.
if err := ctx.Err(); err != nil {
return err
}
// Do nothing after Shutdown.
if bsp.stopped.Load() {
return nil
}
var err error
if bsp.e != nil {
flushCh := make(chan struct{})
if bsp.enqueueBlockOnQueueFull(ctx, forceFlushSpan{flushed: flushCh}) {
select {
case <-bsp.stopCh:
// The batchSpanProcessor is Shutdown.
return nil
case <-flushCh:
// Processed any items in queue prior to ForceFlush being called
case <-ctx.Done():
return ctx.Err()
}
}
wait := make(chan error)
go func() {
wait <- bsp.exportSpans(ctx)

exportSpans() resets the timer:

// exportSpans is a subroutine of processing and draining the queue.
func (bsp *batchSpanProcessor) exportSpans(ctx context.Context) error {
bsp.timer.Reset(bsp.o.BatchTimeout)

For Go 1.22 time#Timer.Reset says:

For a Timer created with NewTimer, Reset should be invoked only on stopped or expired timers with drained channels.

This is not the case for the code above. exportSpans() cannot assume that the timer has expired.


With Go 1.23 looks like this is safe to do.

So, if we want to be compatible with Go 1.22 then the code needs to be reworked to be more careful with the timer. There may be other code too, I haven't looked.

@ash2k
Copy link
Contributor

ash2k commented Oct 12, 2024

I think there is a goroutine leak in ForceFlush():

wait := make(chan error)
go func() {
wait <- bsp.exportSpans(ctx)
close(wait)
}()
// Wait until the export is finished or the context is cancelled/timed out
select {
case err = <-wait:
case <-ctx.Done():
err = ctx.Err()
}

If context is canceled before exportSpans() returns, nothing would receive from the channel and the goroutine will be stuck. Chanel needs to be buffered to have space for the error value.

Also, no need to close the channel.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Archived in project
Development

No branches or pull requests

5 participants