Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky unit test TestServerSinglePort #5519

Closed
yurishkuro opened this issue Jun 3, 2024 · 11 comments · Fixed by #5559
Closed

Flaky unit test TestServerSinglePort #5519

yurishkuro opened this issue Jun 3, 2024 · 11 comments · Fixed by #5559
Labels
bug good first issue Good for beginners help wanted Features that maintainers are willing to accept but do not have cycles to implement

Comments

@yurishkuro
Copy link
Member

This test failed several times recently:

--- FAIL: TestServerSinglePort (10.01s)
    logger.go:146: 2024-06-03T15:06:55.394Z	INFO	Using UI configuration	{"path": ""}
    logger.go:146: 2024-06-03T15:06:55.394Z	INFO	Query server started	{"port": 16686, "addr": ":16686"}
    logger.go:146: 2024-06-03T15:06:55.395Z	INFO	Starting GRPC server	{"port": 16686, "addr": ":16686"}
    logger.go:146: 2024-06-03T15:06:55.395Z	INFO	Starting HTTP server	{"port": 16686, "addr": ":16686"}
    logger.go:146: 2024-06-03T15:06:55.395Z	INFO	Starting CMUX server	{"port": 16686, "addr": ":16686"}
    logger.go:146: 2024-06-03T15:07:05.399Z	INFO	GRPC server stopped	{"port": 16686, "addr": ":16686"}
    logger.go:146: 2024-06-03T15:07:05.399Z	INFO	HTTP server stopped	{"port": 16686, "addr": ":16686"}
    server_test.go:622: 
        	Error Trace:	/home/runner/work/jaeger/jaeger/cmd/query/app/server_test.go:622
        	            				/opt/hostedtoolcache/go/1.22.3/x64/src/testing/testing.go:[117](https://github.com/jaegertracing/jaeger/actions/runs/9352539113/job/25740954249#step:8:118)5
        	            				/opt/hostedtoolcache/go/1.22.3/x64/src/testing/testing.go:1353
        	            				/opt/hostedtoolcache/go/1.22.3/x64/src/testing/testing.go:1683
        	Error:      	Received unexpected error:
        	            	close tcp [::]:16686: use of closed network connection
        	Test:       	TestServerSinglePort
FAIL
coverage: 27.8% of statements
FAIL	github.com/jaegertracing/jaeger/cmd/query/app	[121](https://github.com/jaegertracing/jaeger/actions/runs/9352539113/job/25740954249#step:8:122).373s

Looks like we have some kind of race condition in closing the connection more than once. When we use cmux, the listener is not owned by the http/grpc servers, but they probably each try to close it.

@yurishkuro yurishkuro added bug help wanted Features that maintainers are willing to accept but do not have cycles to implement good first issue Good for beginners labels Jun 3, 2024
@babugeet
Copy link

babugeet commented Jun 3, 2024

Can i work on this issue ?
/assign

@varshith257
Copy link
Contributor

varshith257 commented Jun 3, 2024

@babugeet Yes, go ahead. There is no need for issue assignment. Feel free to make a PR

@arujjval
Copy link

arujjval commented Jun 9, 2024

@yurishkuro I tried to replicate it, but I got no such error
image

@yurishkuro
Copy link
Member Author

I know, this is why it's hard to fix. But the issue exists. It's not even clear from the error which specific server is causing it. A couple of things we could do to make it easier to debug: wrap the errors with contextual messages, and perhaps add logging to the Close methods to see in which order the servers are being closed when the error happens.

@vermaaatul07
Copy link
Contributor

@yurishkuro I've been troubleshooting a persistent timeout issue with the TestServerSinglePort test . Here's what I've done so far:

  1. Dynamic Port Assignment: Used ":0" for automatic port selection.
  2. Enhanced Logging: Improved logging in TestServerSinglePort, Start, and Close methods.
  3. Error Handling: Ensured graceful handling and logging of server start/stop errors.

Despite these changes, the test still times out after 5 minutes. It seems the issue might be within the server code itself. Any guidance would be greatly appreciated.

@vermaaatul07
Copy link
Contributor

I tried to replicate it, but I got no such error

The error does not seem to reproduce consistently, which makes it difficult to diagnose the exact cause.

@yurishkuro
Copy link
Member Author

@vermaaatul07 why timeout? The error log in the ticket is an attempt to close an already closed connection.

@yurishkuro
Copy link
Member Author

I've been troubleshooting

@vermaaatul07 the changes you described seem like useful addition even if they don't resolve the issue. Why don't you open a PR for those? Maybe with increased logging we will be able to pinpoint the issue if the test fails again.

@vermaaatul07
Copy link
Contributor

vermaaatul07 commented Jun 9, 2024

why timeout? The error log in the ticket is an attempt to close an already closed connection.

@yurishkuro I think the timeout happens because the test waits indefinitely for the server to start or stop, indicating an issue with managing the server lifecycle. The error about attempting to close an already closed connection points to a synchronization problem. I have added logging to the Close methods and wrapped errors for better context, but the issue persists.

Edir : now its been resolved

@vermaaatul07
Copy link
Contributor

the changes you described seem like useful addition even if they don't resolve the issue. Why don't you open a PR for those? Maybe with increased logging we will be able to pinpoint the issue if the test fails again.

Thanks for the feedback, i will open a PR for those.

@vermaaatul07
Copy link
Contributor

I know, this is why it's hard to fix. But the issue exists.

@yurishkuro I ran the test 100 times, and it passed every single time. But since the issue was flaky, I'm wondering how we can be sure it's really fixed. Any tips on what else I can do to confirm the fix? Could you take a look and let me know if there's anything else I should check?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug good first issue Good for beginners help wanted Features that maintainers are willing to accept but do not have cycles to implement
Projects
None yet
5 participants