-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Go 1.23 timer changes can lead to hangs #5868
Comments
- Fix timer channel drain to avoid hanging in go1.23 w/ asynctimerchan=0
We can fix this by select-ing on the channel: |
- Fix timer channel drain to avoid hanging in go1.23 w/ asynctimerchan=0 (#5868) --------- Co-authored-by: Robert Pająk <pellared@hotmail.com> Co-authored-by: Damien Mathieu <42@dmathieu.com>
Fixed in #5869. |
### Added - Add `go.opentelemetry.io/otel/sdk/metric/exemplar` package which includes `Exemplar`, `Filter`, `TraceBasedFilter`, `AlwaysOnFilter`, `HistogramReservoir`, `FixedSizeReservoir`, `Reservoir`, `Value` and `ValueType` types. These will be used for configuring the exemplar reservoir for the metrics sdk. (#5747, #5862) - Add `WithExportBufferSize` option to log batch processor.(#5877) ### Changed - Enable exemplars by default in `go.opentelemetry.io/otel/sdk/metric`. Exemplars can be disabled by setting `OTEL_METRICS_EXEMPLAR_FILTER=always_off` (#5778) - `Logger.Enabled` in `go.opentelemetry.io/otel/log` now accepts a newly introduced `EnabledParameters` type instead of `Record`. (#5791) - `FilterProcessor.Enabled` in `go.opentelemetry.io/otel/sdk/log/internal/x` now accepts `EnabledParameters` instead of `Record`. (#5791) - The `Record` type in `go.opentelemetry.io/otel/log` is no longer comparable. (#5847) - Performance improvements for the trace SDK `SetAttributes` method in `Span`. (#5864) - Reduce memory allocations for the `Event` and `Link` lists in `Span`. (#5858) - Performance improvements for the trace SDK `AddEvent`, `AddLink`, `RecordError` and `End` methods in `Span`. (#5874) ### Deprecated - Deprecate all examples under `go.opentelemetry.io/otel/example` as they are moved to [Contrib repository](https://github.com/open-telemetry/opentelemetry-go-contrib/tree/main/examples). (#5854) ### Fixed - The race condition for multiple `FixedSize` exemplar reservoirs identified in #5814 is resolved. (#5819) - Fix log records duplication in case of heterogeneous resource attributes by correctly mapping each log record to it's resource and scope. (#5803) - Fix timer channel drain to avoid hanging on Go 1.23. (#5868) - Fix delegation for global meter providers, and panic when calling otel.SetMeterProvider. (#5827) - Change the `reflect.TypeOf` to use a nil pointer to not allocate on the heap unless necessary. (#5827)
I don't think there was a change in semantics - that would have broken the Go 1 compatibility promise. This program works as expected: https://go.dev/play/p/KGLGXCurLA_o. What you saw was probably this bug. Go 1.23.0 and 1.23.1 are affected, 1.23.2 has the fix. |
See:
In your snippet |
I think both examples you provided are buggy use of timers. I'd expect them to deadlock in any version. 🤷 |
@ash2k, you are right. They deadlock in Go 1.22 as well. I only see a difference between Go 1.22 and Go 1.23 in the output for the following example https://go.dev/play/p/iVkFnd5gVc8. After experimenting and reading a few times it looks like we could indeed revert the bugfix thanks to the fix for golang/go#69312. On the other side, the current code is still correct and more defensive (e.g. works for people who have not patched Go 1.23). Therefore, maybe it is better to leave it as it is. Do you have any preference/suggestion on what we should do and why? Personally, I have no strong preference. PS. Thanks a lot for taking your time to needed to give us feedback. I truly appreciate your effort. |
I don't think there is a lot of sense to keep the workaround - there is so much code using timers that is broken by the Go bug, that having a workaround in a single place doesn't make sense. Also, looks like the current timer code is buggy if used with Go 1.22 and needs to be reworked anyway:
opentelemetry-go/sdk/trace/batch_span_processor.go Lines 178 to 206 in b3c313f
opentelemetry-go/sdk/trace/batch_span_processor.go Lines 262 to 264 in b3c313f
For Go 1.22
This is not the case for the code above. With Go 1.23 looks like this is safe to do. So, if we want to be compatible with Go 1.22 then the code needs to be reworked to be more careful with the timer. There may be other code too, I haven't looked. |
I think there is a goroutine leak in opentelemetry-go/sdk/trace/batch_span_processor.go Lines 204 to 214 in b3c313f
If context is canceled before Also, no need to close the channel. |
Description
Go 1.23 timer semantic changes the timer channels to be non-buffered;
One side effect of this is that now, trying to drain timer channels after calling t.Stop() will always block.
We've observed occasional hangs coming from the timer channel drain in
batch_span_processor.go
:Environment
The text was updated successfully, but these errors were encountered: