[Bug]: Flaky Test: TestSpanProcessorWithOnDroppedSpanOption #4450

albertteoh · 2023-05-09T11:12:36Z

What happened?

PR (note changes are unrelated to the test failure): #4446

Test failure link

https://github.com/jaegertracing/jaeger/actions/runs/4919763496/jobs/8787773908?pr=4446#step:7:590

Retrying the test job resulted in all tests passing.

Steps to reproduce

Could not reproduce.

Failing test was introduced from this PR: #4387

Expected behavior

Tests to pass.

Relevant log output

=== RUN   TestSpanProcessorWithOnDroppedSpanOption
    span_processor_test.go:682: 
        	Error Trace:	/home/runner/work/jaeger/jaeger/cmd/collector/app/span_processor_test.go:682
        	Error:      	Not equal: 
        	            	expected: []string{"op2", "op3"}
        	            	actual  : []string{"op3"}
        	            	
        	            	Diff:
        	            	--- Expected
        	            	+++ Actual
        	            	@@ -1,3 +1,2 @@
        	            	-([]string) (len=2) {
        	            	- (string) (len=3) "op2",
        	            	+([]string) (len=1) {
        	            	  (string) (len=3) "op3"
        	Test:       	TestSpanProcessorWithOnDroppedSpanOption
--- FAIL: TestSpanProcessorWithOnDroppedSpanOption (0.00s)
FAIL
	github.com/jaegertracing/jaeger/cmd/collector/app	coverage: 98.5% of statements

Screenshot

No response

Additional context

No response

Jaeger backend version

No response

SDK

No response

Pipeline

No response

Stogage backend

No response

Operating system

No response

Deployment model

No response

Deployment configs

No response

ChenX1993 · 2023-05-10T05:45:30Z

I think it is a concurrent problem.
I was trying to use the blocking writer to make a span drop situation, but the second test span is not 100% guaranteed to be dropped.

jaeger/cmd/collector/app/span_processor_test.go

Lines 657 to 683 in 5c3c20d

    
           func TestSpanProcessorWithOnDroppedSpanOption(t *testing.T) { 
        
           	var droppedOperations []string 
        
           	customOnDroppedSpan := func(span *model.Span) { 
        
           		droppedOperations = append(droppedOperations, span.OperationName) 
        
           	} 
        
           	w := &blockingWriter{} 
        
           	p := NewSpanProcessor(w, 
        
           		nil, 
        
           		Options.NumWorkers(1), 
        
           		Options.QueueSize(1), 
        
           		Options.OnDroppedSpan(customOnDroppedSpan), 
        
           	).(*spanProcessor) 
        
           	defer p.Close() 
        
           	// block the writer so that the first span is read from the queue and blocks the processor, and followings are dropped. 
        
           	w.Lock() 
        
           	defer w.Unlock() 
        
           	_, err := p.ProcessSpans([]*model.Span{ 
        
           		{OperationName: "op1"}, 
        
           		{OperationName: "op2"}, 
        
           		{OperationName: "op3"}, 
        
           	}, processor.SpansOptions{SpanFormat: processor.JaegerSpanFormat}) 
        
           	assert.NoError(t, err) 
        
           	assert.Equal(t, []string{"op2", "op3"}, droppedOperations) 
        
           }

Below is the code for the consume part of the bouned_queue used in collector processor.
If the second test span is enqueued slow enough after L83 of processing the first span, then the second span will be enqueued successfully which causes the test failure as the test expect the second span to be dropped. Same for the third span.

jaeger/pkg/queue/bounded_queue.go

Lines 67 to 97 in 5c3c20d

    
           func (q *BoundedQueue) StartConsumersWithFactory(num int, factory func() Consumer) { 
        
           	q.workers = num 
        
           	q.factory = factory 
        
           	var startWG sync.WaitGroup 
        
           	for i := 0; i < q.workers; i++ { 
        
           		q.stopWG.Add(1) 
        
           		startWG.Add(1) 
        
           		go func() { 
        
           			startWG.Done() 
        
           			defer q.stopWG.Done() 
        
           			consumer := q.factory() 
        
           			queue := *q.items 
        
           			for { 
        
           				select { 
        
           				case item, ok := <-queue: 
        
           					if ok { 
        
           						q.size.Sub(1) 
        
           						consumer.Consume(item) 
        
           					} else { 
        
           						// channel closed, finish worker 
        
           						return 
        
           					} 
        
           				case <-q.stopCh: 
        
           					// the whole queue is closing, finish worker 
        
           					return 
        
           				} 
        
           			} 
        
           		}() 
        
           	} 
        
           	startWG.Wait() 
        
           }

If I change the test like below, then it will fail almost every time:

...
p.ProcessSpans([]*model.Span{{OperationName: "op1"}}, processor.SpansOptions{SpanFormat: processor.JaegerSpanFormat})
time.Sleep(1 * time.Second)
p.ProcessSpans([]*model.Span{{OperationName: "op2"}}, processor.SpansOptions{SpanFormat: processor.JaegerSpanFormat})
p.ProcessSpans([]*model.Span{{OperationName: "op3"}}, processor.SpansOptions{SpanFormat: processor.JaegerSpanFormat})
...

And the error is the same:

Not equal: 
        expected: []string{"op2", "op3"}
        actual  : []string{"op3"}

I will think about how to write the test.

* Resolves #4450 * Alternative to #4451 Signed-off-by: Yuri Shkuro <github@ysh.us>

albertteoh added the bug label May 9, 2023

yurishkuro added good first issue Good for beginners help wanted Features that maintainers are willing to accept but do not have cycles to implement labels May 9, 2023

ChenX1993 mentioned this issue May 10, 2023

Fix flaky test - TestSpanProcessorWithOnDroppedSpanOption #4451

Closed

yurishkuro mentioned this issue May 27, 2023

Fix flaky test - TestSpanProcessorWithOnDroppedSpanOption #4489

Merged

yurishkuro closed this as completed in #4489 May 27, 2023

yurishkuro added a commit that referenced this issue May 27, 2023

Fix flaky test - TestSpanProcessorWithOnDroppedSpanOption (#4489)

756bc4e

* Resolves #4450 * Alternative to #4451 Signed-off-by: Yuri Shkuro <github@ysh.us>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Flaky Test: TestSpanProcessorWithOnDroppedSpanOption #4450

[Bug]: Flaky Test: TestSpanProcessorWithOnDroppedSpanOption #4450

albertteoh commented May 9, 2023 •

edited

Loading

ChenX1993 commented May 10, 2023 •

edited

Loading

[Bug]: Flaky Test: TestSpanProcessorWithOnDroppedSpanOption #4450

[Bug]: Flaky Test: TestSpanProcessorWithOnDroppedSpanOption #4450

Comments

albertteoh commented May 9, 2023 • edited Loading

What happened?

Steps to reproduce

Expected behavior

Relevant log output

Screenshot

Additional context

Jaeger backend version

SDK

Pipeline

Stogage backend

Operating system

Deployment model

Deployment configs

ChenX1993 commented May 10, 2023 • edited Loading

albertteoh commented May 9, 2023 •

edited

Loading

ChenX1993 commented May 10, 2023 •

edited

Loading