tools: await-connectivity to fix CI flakes #4168
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
It has previously been observed that the integration tests are flaky
in the supervisord-based setup unless we keep the specific startup
sequence that starts dispatcher and routers before all other services
(see commit b94c063, or also #4164).
If the integration tests start before the control services have
established segments for full connectivity, i.e. offer a least one path
to every destination, the tests will fail.
When control services start before routers, it takes longer to reach
full connectivity. The first beacon origination can happen up to 5
seconds later, as the first attempt may result in a failed service
resolution attempt that needs to time out.
The CI pipeline would sleep for a fixed duration of 10 seconds which
is just enough to reach connectivity with the ideal startup sequence,
but if the processes randomly start in a bad order, it can fail.
In the docker-based topology setup this issue does not appear, simply
because the process startup (in particular for the servers of the e2e
tests) is so much slower that there is ample time to reach connectivity.
This change replaces the fixed sleep duration with a script, based on
the control service's segments API, that explicitly waits for full
connectivity, with a timeout.
Remove the hand-optimized startup sequence for supervisor tasks from
scion.sh -- a randomly bad startup sequence can and should be tolerated.
Also, more cleanup for scion.sh: don't set up ipv6 local addresses when
running docker setup. Always perform the corresponding cleanup.
This change is