Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IpVersions/EchoIntegrationTest.AddRemoveListener/IPv6 is flaky #3997

Closed
lizan opened this issue Jul 31, 2018 · 3 comments · Fixed by #4304
Closed

IpVersions/EchoIntegrationTest.AddRemoveListener/IPv6 is flaky #3997

lizan opened this issue Jul 31, 2018 · 3 comments · Fixed by #4304
Assignees

Comments

@lizan
Copy link
Member

lizan commented Jul 31, 2018

Description:
on master 028387a run:
bazel test --runs_per_test=100 //test/integration:echo_integration_test

will result 2 runs out of 100 TIMEOUT, with "-l trace" got 8 out of 100.

@mattklein123
Copy link
Member

I've seen this also on my own machine.

@zuercher
Copy link
Member

zuercher commented Aug 14, 2018

I poke around this a bit yesterday evening and it seems to be a race in the AddRemoveListenerTest between the RawConnectionDriver making a connect attempt and the actual listener socket being closed.

It seems like sometimes the listener socket is closed concurrently with the RawConnectionDriver's connect attempt. The RawConnectionDriver's ConnectionImpl sees a successful connection (write event triggers onWriteReady and getsockopt returns no error) and then writes the initial data. No further events occur and the RawConnectionDriver waits in Dispatcher::run until the test times out.

When the test passes, the connect either happens before or after the socket close which either leads to an immediate connect failure or a deferred one, and in both those cases the test terminates successfully.

@alyssawilk
Copy link
Contributor

This was failing enough today I'd back disabling first and debugging later, if anyone is willing to own debug

@alyssawilk alyssawilk self-assigned this Aug 30, 2018
htuch pushed a commit that referenced this issue Aug 30, 2018
At least one failure mode is that when the listener was released, some other test would yoink the released port, and the "make sure we can not connect to a removed listener" check would unexpectedly result in a connection. Running the test as exclusive should fix that particular failure mode, and allow us to see if others exist.

I believe the reason the test was flaking more often when run in parallel with -l trace is because the test ran more slowly, the lag between the listener releasing the port and the raw connection driver increased, so the likelihood that another test would snag the port also increased.

Risk Level: Low (test only)
Testing: 1000 runs with "exclusive"
Docs Changes: n/a
Release Notes: n/a

Fixes #3997

Signed-off-by: Alyssa Wilk <alyssar@chromium.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants