Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tools: await-connectivity to fix CI flakes #4168

Closed
wants to merge 1 commit into from

Conversation

matzf
Copy link
Contributor

@matzf matzf commented Mar 16, 2022

It has previously been observed that the integration tests are flaky
in the supervisord-based setup unless we keep the specific startup
sequence that starts dispatcher and routers before all other services
(see commit b94c063, or also #4164).
If the integration tests start before the control services have
established segments for full connectivity, i.e. offer a least one path
to every destination, the tests will fail.
When control services start before routers, it takes longer to reach
full connectivity. The first beacon origination can happen up to 5
seconds later, as the first attempt may result in a failed service
resolution attempt that needs to time out.
The CI pipeline would sleep for a fixed duration of 10 seconds which
is just enough to reach connectivity with the ideal startup sequence,
but if the processes randomly start in a bad order, it can fail.

In the docker-based topology setup this issue does not appear, simply
because the process startup (in particular for the servers of the e2e
tests) is so much slower that there is ample time to reach connectivity.

This change replaces the fixed sleep duration with a script, based on
the control service's segments API, that explicitly waits for full
connectivity, with a timeout.
Remove the hand-optimized startup sequence for supervisor tasks from
scion.sh -- a randomly bad startup sequence can and should be tolerated.

Also, more cleanup for scion.sh: don't set up ipv6 local addresses when
running docker setup. Always perform the corresponding cleanup.


This change is Reviewable

It has previously been observed that the integration tests are flaky
in the supervisord-based setup unless we keep the specific startup
sequence that starts dispatcher and routers before all other services
(see commit b94c063, or also scionproto#4164).

If the integration tests start before the control services have
established segments for full connectivity, i.e. offer a least one path
to every destination, the tests will fail.
When control services start before routers, it takes longer to reach
full connectivity. The first beacon origination can happen up to 5
seconds later, as the first attempt may result in a failed service
resolution attempt that needs to time out.
The CI pipeline would sleep for a fixed duration of 10 seconds which
is just enough to reach connectivity with the ideal startup sequence,
but if the processes randomly start in a bad order, it can fail.

In the docker-based topology setup this issue does not appear, simply
because the process startup (in particular for the servers of the e2e
tests) is so much slower that there is ample time to reach connectivity.

This change replaces the fixed sleep duration with a script, based on
the control service's segments API, that explicitly waits for full
connectivity, with a timeout.
Remove the hand-optimized startup sequence for supervisor tasks from
scion.sh -- a randomly bad startup sequence can and should be tolerated.

Also, more cleanup for scion.sh: don't set up ipv6 local addresses when
running docker setup. Always perform the corresponding cleanup.
Copy link
Contributor

@oncilla oncilla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 3 of 3 files at r1, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @matzf)

a discussion (no related file):
Cool stuff 💯


Copy link
Contributor

@oncilla oncilla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @matzf)

@oncilla oncilla closed this in fc662ed Mar 18, 2022
@matzf matzf deleted the cleanup-topo-run branch June 16, 2022 14:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants