Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix panic, deadlock and race in BatchSpanProcessor #4353

Merged
merged 9 commits into from
Jul 25, 2023

Conversation

pellared
Copy link
Member

@pellared pellared commented Jul 24, 2023

Fixes #4249

Fixes #4335

Test result of TestBatchSpanProcessorConcurrentSafe test executed on main (without the fixes):

go test -race -run TestBatchSpanProcessorConcurrentSafe -count=100 -timeout 10s
panic: test timed out after 10s
running tests:
        TestBatchSpanProcessorConcurrentSafe (10s)

goroutine 81 [running]:
testing.(*M).startAlarm.func1()
        /usr/local/go/src/testing/testing.go:2241 +0x219
created by time.goFunc
        /usr/local/go/src/time/sleep.go:176 +0x48

goroutine 1 [chan receive]:
testing.(*T).Run(0xc000396000, {0x9c7b81, 0x24}, 0x9ee310)
        /usr/local/go/src/testing/testing.go:1630 +0x82e
testing.runTests.func1(0x0?)
        /usr/local/go/src/testing/testing.go:2036 +0x8e
testing.tRunner(0xc000396000, 0xc00065fb48)
        /usr/local/go/src/testing/testing.go:1576 +0x217
testing.runTests(0xc0000b7f40?, {0xcf1600, 0x5f, 0x5f}, {0x1c?, 0x4a8e99?, 0xcf96a0?})
        /usr/local/go/src/testing/testing.go:2034 +0x87d
testing.(*M).Run(0xc0000b7f40)
        /usr/local/go/src/testing/testing.go:1906 +0xb45
main.main()
        _testmain.go:269 +0x2ea

goroutine 73 [semacquire]:
sync.runtime_Semacquire(0xc0001200a8?)
        /usr/local/go/src/runtime/sema.go:62 +0x27
sync.(*WaitGroup).Wait(0xc0001200a0)
        /usr/local/go/src/sync/waitgroup.go:116 +0xa5
go.opentelemetry.io/otel/sdk/trace_test.TestBatchSpanProcessorConcurrentSafe(0xc0003961a0)
        /home/rpajak/repos/opentelemetry-go/sdk/trace/batch_span_processor_test.go:637 +0x705
testing.tRunner(0xc0003961a0, 0x9ee310)
        /usr/local/go/src/testing/testing.go:1576 +0x217
created by testing.(*T).Run
        /usr/local/go/src/testing/testing.go:1629 +0x806

goroutine 76 [select]:
go.opentelemetry.io/otel/sdk/trace.(*batchSpanProcessor).ForceFlush(0xc00010e000, {0xa8c5d0?, 0xc00001a0c8})
        /home/rpajak/repos/opentelemetry-go/sdk/trace/batch_span_processor.go:188 +0x179
go.opentelemetry.io/otel/sdk/trace_test.TestBatchSpanProcessorConcurrentSafe.func2()
        /home/rpajak/repos/opentelemetry-go/sdk/trace/batch_span_processor_test.go:616 +0xc3
created by go.opentelemetry.io/otel/sdk/trace_test.TestBatchSpanProcessorConcurrentSafe
        /home/rpajak/repos/opentelemetry-go/sdk/trace/batch_span_processor_test.go:614 +0x3ca
exit status 2
FAIL    go.opentelemetry.io/otel/sdk/trace      10.027s

As you can see ForceFlush was is not properly synchronized and there is a possible deadlock (sic!).

Here is the test result (notice the test count is 100 times bigger) after the fixes are applied:

$ go test -race -run TestBatchSpanProcessorConcurrentSafe -count=10000 -timeout 10s
PASS
ok      go.opentelemetry.io/otel/sdk/trace      7.778s

Here is the test result when I only cherry-pick 68e62aa on main (it contains a hotfix for the deadlock, but the race condition is still in place). It proves that the PR fixes #4249.

$ go test -race -run TestBatchSpanProcessorConcurrentSafe -count=10000 -timeout 10s
==================
WARNING: DATA RACE
Write at 0x00c000490eb0 by goroutine 765:
  runtime.closechan()
      /usr/local/go/src/runtime/chan.go:357 +0x0
  go.opentelemetry.io/otel/sdk/trace.(*batchSpanProcessor).drainQueue()
      /home/rpajak/repos/opentelemetry-go/sdk/trace/batch_span_processor.go:350 +0x335
  go.opentelemetry.io/otel/sdk/trace.NewBatchSpanProcessor.func1()
      /home/rpajak/repos/opentelemetry-go/sdk/trace/batch_span_processor.go:129 +0x7c

Previous read at 0x00c000490eb0 by goroutine 766:
  runtime.chansend()
      /usr/local/go/src/runtime/chan.go:160 +0x0
  go.opentelemetry.io/otel/sdk/trace.(*batchSpanProcessor).enqueueDrop()
      /home/rpajak/repos/opentelemetry-go/sdk/trace/batch_span_processor.go:416 +0x194
  go.opentelemetry.io/otel/sdk/trace.(*batchSpanProcessor).enqueue()
      /home/rpajak/repos/opentelemetry-go/sdk/trace/batch_span_processor.go:360 +0xa4
  go.opentelemetry.io/otel/sdk/trace.(*batchSpanProcessor).OnEnd()
      /home/rpajak/repos/opentelemetry-go/sdk/trace/batch_span_processor.go:144 +0x52
  go.opentelemetry.io/otel/sdk/trace.(*recordingSpan).End()
      /home/rpajak/repos/opentelemetry-go/sdk/trace/span.go:419 +0xc44
  go.opentelemetry.io/otel/sdk/trace_test.generateSpan()
      /home/rpajak/repos/opentelemetry-go/sdk/trace/batch_span_processor_test.go:362 +0x39d
  go.opentelemetry.io/otel/sdk/trace_test.TestBatchSpanProcessorConcurrentSafe.func1()
      /home/rpajak/repos/opentelemetry-go/sdk/trace/batch_span_processor_test.go:610 +0x10c

Goroutine 765 (running) created at:
  go.opentelemetry.io/otel/sdk/trace.NewBatchSpanProcessor()
      /home/rpajak/repos/opentelemetry-go/sdk/trace/batch_span_processor.go:126 +0x52e
  go.opentelemetry.io/otel/sdk/trace_test.TestBatchSpanProcessorConcurrentSafe()
      /home/rpajak/repos/opentelemetry-go/sdk/trace/batch_span_processor_test.go:600 +0xcd
  testing.tRunner()
      /usr/local/go/src/testing/testing.go:1576 +0x216
  testing.(*T).Run.func1()
      /usr/local/go/src/testing/testing.go:1629 +0x47

Goroutine 766 (running) created at:
  go.opentelemetry.io/otel/sdk/trace_test.TestBatchSpanProcessorConcurrentSafe()
      /home/rpajak/repos/opentelemetry-go/sdk/trace/batch_span_processor_test.go:608 +0x2a4
  testing.tRunner()
      /usr/local/go/src/testing/testing.go:1576 +0x216
  testing.(*T).Run.func1()
      /usr/local/go/src/testing/testing.go:1629 +0x47
==================
--- FAIL: TestBatchSpanProcessorConcurrentSafe (0.00s)
    testing.go:1446: race detected during execution of test
==================
WARNING: DATA RACE
Write at 0x00c00062e070 by goroutine 4345:
  runtime.closechan()
      /usr/local/go/src/runtime/chan.go:357 +0x0
  go.opentelemetry.io/otel/sdk/trace.(*batchSpanProcessor).drainQueue()
      /home/rpajak/repos/opentelemetry-go/sdk/trace/batch_span_processor.go:350 +0x335
  go.opentelemetry.io/otel/sdk/trace.NewBatchSpanProcessor.func1()
      /home/rpajak/repos/opentelemetry-go/sdk/trace/batch_span_processor.go:129 +0x7c

Previous read at 0x00c00062e070 by goroutine 4347:
  runtime.chansend()
      /usr/local/go/src/runtime/chan.go:160 +0x0
  go.opentelemetry.io/otel/sdk/trace.(*batchSpanProcessor).enqueueBlockOnQueueFull()
      /home/rpajak/repos/opentelemetry-go/sdk/trace/batch_span_processor.go:393 +0x209
  go.opentelemetry.io/otel/sdk/trace.(*batchSpanProcessor).ForceFlush()
      /home/rpajak/repos/opentelemetry-go/sdk/trace/batch_span_processor.go:187 +0xeb
  go.opentelemetry.io/otel/sdk/trace_test.TestBatchSpanProcessorConcurrentSafe.func2()
      /home/rpajak/repos/opentelemetry-go/sdk/trace/batch_span_processor_test.go:616 +0xc2

Goroutine 4345 (running) created at:
  go.opentelemetry.io/otel/sdk/trace.NewBatchSpanProcessor()
      /home/rpajak/repos/opentelemetry-go/sdk/trace/batch_span_processor.go:126 +0x52e
  go.opentelemetry.io/otel/sdk/trace_test.TestBatchSpanProcessorConcurrentSafe()
      /home/rpajak/repos/opentelemetry-go/sdk/trace/batch_span_processor_test.go:600 +0xcd
  testing.tRunner()
      /usr/local/go/src/testing/testing.go:1576 +0x216
  testing.(*T).Run.func1()
      /usr/local/go/src/testing/testing.go:1629 +0x47

Goroutine 4347 (finished) created at:
  go.opentelemetry.io/otel/sdk/trace_test.TestBatchSpanProcessorConcurrentSafe()
      /home/rpajak/repos/opentelemetry-go/sdk/trace/batch_span_processor_test.go:614 +0x3c9
  testing.tRunner()
      /usr/local/go/src/testing/testing.go:1576 +0x216
  testing.(*T).Run.func1()
      /usr/local/go/src/testing/testing.go:1629 +0x47
==================
--- FAIL: TestBatchSpanProcessorConcurrentSafe (0.00s)
    testing.go:1446: race detected during execution of test
==================
WARNING: DATA RACE
Read at 0x00c00063c0d0 by goroutine 46132:
  runtime.chansend()
      /usr/local/go/src/runtime/chan.go:160 +0x0
  go.opentelemetry.io/otel/sdk/trace.(*batchSpanProcessor).enqueueBlockOnQueueFull()
      /home/rpajak/repos/opentelemetry-go/sdk/trace/batch_span_processor.go:393 +0x209
  go.opentelemetry.io/otel/sdk/trace.(*batchSpanProcessor).ForceFlush()
      /home/rpajak/repos/opentelemetry-go/sdk/trace/batch_span_processor.go:187 +0xeb
  go.opentelemetry.io/otel/sdk/trace.(*TracerProvider).ForceFlush()
      /home/rpajak/repos/opentelemetry-go/sdk/trace/provider.go:262 +0x126
  go.opentelemetry.io/otel/sdk/trace_test.TestBatchSpanProcessorConcurrentSafe.func4()
      /home/rpajak/repos/opentelemetry-go/sdk/trace/batch_span_processor_test.go:628 +0xaa

Previous write at 0x00c00063c0d0 by goroutine 46128:
  runtime.closechan()
      /usr/local/go/src/runtime/chan.go:357 +0x0
  go.opentelemetry.io/otel/sdk/trace.(*batchSpanProcessor).drainQueue()
      /home/rpajak/repos/opentelemetry-go/sdk/trace/batch_span_processor.go:350 +0x335
  go.opentelemetry.io/otel/sdk/trace.NewBatchSpanProcessor.func1()
      /home/rpajak/repos/opentelemetry-go/sdk/trace/batch_span_processor.go:129 +0x7c

Goroutine 46132 (running) created at:
  go.opentelemetry.io/otel/sdk/trace_test.TestBatchSpanProcessorConcurrentSafe()
      /home/rpajak/repos/opentelemetry-go/sdk/trace/batch_span_processor_test.go:626 +0x5f2
  testing.tRunner()
      /usr/local/go/src/testing/testing.go:1576 +0x216
  testing.(*T).Run.func1()
      /usr/local/go/src/testing/testing.go:1629 +0x47

Goroutine 46128 (running) created at:
  go.opentelemetry.io/otel/sdk/trace.NewBatchSpanProcessor()
      /home/rpajak/repos/opentelemetry-go/sdk/trace/batch_span_processor.go:126 +0x52e
  go.opentelemetry.io/otel/sdk/trace_test.TestBatchSpanProcessorConcurrentSafe()
      /home/rpajak/repos/opentelemetry-go/sdk/trace/batch_span_processor_test.go:600 +0xcd
  testing.tRunner()
      /usr/local/go/src/testing/testing.go:1576 +0x216
  testing.(*T).Run.func1()
      /usr/local/go/src/testing/testing.go:1629 +0x47
==================
--- FAIL: TestBatchSpanProcessorConcurrentSafe (0.00s)
    testing.go:1446: race detected during execution of test
FAIL
exit status 1
FAIL    go.opentelemetry.io/otel/sdk/trace      8.537s

It may be good to revisit the synchronization implementation, but I would rather do it as a separate PR. However, I think that current implementation is good enough for now. I created #4354

@codecov
Copy link

codecov bot commented Jul 24, 2023

Codecov Report

Merging #4353 (0e226f7) into main (e26d8bd) will decrease coverage by 0.1%.
The diff coverage is 70.0%.

Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##            main   #4353     +/-   ##
=======================================
- Coverage   83.5%   83.4%   -0.1%     
=======================================
  Files        184     184             
  Lines      14383   14374      -9     
=======================================
- Hits       12011   11997     -14     
- Misses      2145    2149      +4     
- Partials     227     228      +1     
Files Changed Coverage Δ
sdk/trace/batch_span_processor.go 79.0% <70.0%> (-3.0%) ⬇️

@pellared pellared changed the title Fix races in BatchSpanProcessor Fix deadlock and race in BatchSpanProcessor Jul 24, 2023
@pellared
Copy link
Member Author

pellared commented Jul 24, 2023

FYI @peolivei2

sdk/trace/batch_span_processor.go Outdated Show resolved Hide resolved
sdk/trace/batch_span_processor.go Show resolved Hide resolved
Co-authored-by: David Ashpole <dashpole@google.com>
@pellared pellared changed the title Fix deadlock and race in BatchSpanProcessor Fix panic, deadlock and race in BatchSpanProcessor Jul 24, 2023
@pellared pellared merged commit 088ac8e into open-telemetry:main Jul 25, 2023
20 of 21 checks passed
@pellared pellared deleted the fix-race-bsp branch July 25, 2023 08:13
@MrAlias MrAlias added this to the v1.17.0 milestone Aug 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

panic: runtime error: invalid memory address or nil pointer dereference Data race on BatchSpanProcessor
4 participants