Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[receiver/jaeger] Flaky TestJaegerAgentUDP_ThriftCompact #10368

Closed
djaglowski opened this issue May 26, 2022 · 14 comments · Fixed by #10872
Closed

[receiver/jaeger] Flaky TestJaegerAgentUDP_ThriftCompact #10368

djaglowski opened this issue May 26, 2022 · 14 comments · Fixed by #10872
Assignees
Labels
bug Something isn't working flaky test a test is flaky good first issue Good for newcomers help wanted Extra attention is needed never stale Issues marked with this label will be never staled and automatically removed priority:p3 Lowest receiver/jaeger

Comments

@djaglowski
Copy link
Member

=== RUN   TestJaegerAgentUDP_ThriftCompact
    jaeger_agent_test.go:204: 
        	Error Trace:	jaeger_agent_test.go:204
        	            				jaeger_agent_test.go:53
        	Error:      	Received unexpected error:
        	            	listen udp :55256: bind: An attempt was made to access a socket in a way forbidden by its access permissions.
        	Test:       	TestJaegerAgentUDP_ThriftCompact
        	Messages:   	Start failed
--- FAIL: TestJaegerAgentUDP_ThriftCompact (0.23s)

https://github.com/open-telemetry/opentelemetry-collector-contrib/runs/6614143811?check_suite_focus=true#step:7:4448

@djaglowski
Copy link
Member Author

Reopening, as this is still failing intermittently.

Observed here: https://github.com/open-telemetry/opentelemetry-collector-contrib/runs/6863664900?check_suite_focus=true#step:7:4483

@djaglowski djaglowski reopened this Jun 13, 2022
@jpkrohling jpkrohling self-assigned this Jul 5, 2022
@jpkrohling jpkrohling added the comp:jaeger Jaeger related issues label Jul 29, 2022
@jpkrohling
Copy link
Member

@frzifus, would you like to pick this one up?

@frzifus
Copy link
Member

frzifus commented Aug 25, 2022

Sure :)

@frzifus
Copy link
Member

frzifus commented Sep 9, 2022

For me it looks like there is something wrong in this area. Unfortunately ive no emulation or windows machine in place.

@dgoscn would you like to give it a try?

https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/ed4cafc263667b8b57436e6413d83e9caf2cb2d7/internal/common/testutil/testutil.go#L42-L57

@dgoscn
Copy link
Contributor

dgoscn commented Sep 9, 2022

Yes @frzifus of course. Thanks

@dgoscn
Copy link
Contributor

dgoscn commented Sep 16, 2022

Hi @djaglowski. How are you?

I had some trouble to replicate this error that you mentioned here. But, I made some tries. I will try to replicate here what I did and please, if what I made makes no sense, just say 🦖 thank you


So, what I did:

In the function GetAvailableLocalNetworkAddress located at /internal/commom/testutil/testutil.go I added the following for to check the values of the port

	portFound := false
	if runtime.GOOS == "windows" {
		exclusions = getExclusionsList(t)
               // The line below were added
		for _, pp := range exclusions {
			fmt.Printf("From %s to %s \n", pp.first, pp.last)
		}
	}

And I added another Println just to get the result of the endpoint declared right below the code above, more precisely in line 60 :

	var endpoint string
	for !portFound {
		endpoint = findAvailableAddress(t, network)
		_, port, err := net.SplitHostPort(endpoint)
		require.NoError(t, err)
		portFound = true
		if runtime.GOOS == "windows" {
			for _, pair := range exclusions {
				if port >= pair.first && port <= pair.last {
					portFound = false
					break
				}
			}
		}
	}
fmt.Println(endpoint) // <------ line added

I executed a go test just to check the output and to see if were throwing some error or showing some blocked port. But I got the following output, for instance:

image

So, let's go back to the receiver/jaegerreceiver/jaeger_agent_test.go where the function mentioned is used: TestJaegerAgentUDP_ThriftCompact

I went to this path, and made a -failtest to the go test -run with count of 3000 for 10minutes and received the following scenario, but I also made the test with the count value 4000 and 5000 receiving the same output.:

go test -run TestJaegerAgentUDP_ThriftCompact -v -count 3000 -failfast and got the:

image

I let the test running for some minutes in all of the tests, it pass! I decreased the number of -count to 1000

image

And after decresead to 100:

image

I noticed that if I run the -count with the value 100 or a number less than 1000, I get success and for a higher than, I could get the timout printed above.

So, when I run for example the --count for a higher number than 3000, sometimes I can get the error that you mentioned on the top of the issue. But it's not always that I got the error. I made some workarounds, killing some process using a port that was in use or something like that, allowing in firewall etc, but I think that is not the right answer.

image

I was wondering to make a check if the port is already in use, if this returns true, we can skip the port or something like that... How we are getting some block binds to UDP ports, I found this piece of code that "maybe" we can "reuse", make some changes and resolve this definitively.

image

The path for the code above:
pkg/stanza/operator/input/udp/udp_test.go line 165

Do you think that makes sense?

Thank you

@djaglowski
Copy link
Member Author

@dgoscn, thanks for looking into this. It's great that you've determined a way to reproduce the issue.

I was wondering to make a check if the port is already in use, if this returns true, we can skip the port or something like that... How we are getting some block binds to UDP ports, I found this piece of code that "maybe" we can "reuse", make some changes and resolve this definitively.

I'm not sure we can solve the problem by validating the port independently of the actual test, because some (tiny) amount of time will pass before we attempt to use it. But perhaps you are suggesting we just recover and retry the test? That may be reasonable if limited to a small number of retries.

@dgoscn
Copy link
Contributor

dgoscn commented Sep 16, 2022

@djaglowski thanks for your answer.

Hmmm. Yeah, I think that make sense what you said. About the reasonable time for the test, yes. I agree! from what I observed, keep a small number it's enough, because we are not trying to make some kind of brute force test. Maybe a limited number of retries, make sense.

Thank you one more time. o/

@evan-bradley evan-bradley added receiver/jaeger and removed comp:jaeger Jaeger related issues labels Sep 30, 2022
@github-actions
Copy link
Contributor

Pinging code owners: @jpkrohling. See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions
Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@jpkrohling
Copy link
Member

I don't remember seeing this issue happening anymore, but I also don't remember having this fixed. I'm leaving it open for some more time.

@dgoscn
Copy link
Contributor

dgoscn commented Nov 30, 2022

great @jpkrohling

@github-actions
Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Jan 30, 2023
@fatsheep9146 fatsheep9146 added never stale Issues marked with this label will be never staled and automatically removed and removed Stale labels Jan 30, 2023
@jpkrohling
Copy link
Member

Closing, as there haven't been any new reports about this. Feel free to reopen or comment if this happens again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working flaky test a test is flaky good first issue Good for newcomers help wanted Extra attention is needed never stale Issues marked with this label will be never staled and automatically removed priority:p3 Lowest receiver/jaeger
Projects
None yet
7 participants