tools: await-connectivity to fix CI flakes #4168

matzf · 2022-03-16T19:40:56Z

It has previously been observed that the integration tests are flaky
in the supervisord-based setup unless we keep the specific startup
sequence that starts dispatcher and routers before all other services
(see commit b94c063, or also #4164).
If the integration tests start before the control services have
established segments for full connectivity, i.e. offer a least one path
to every destination, the tests will fail.
When control services start before routers, it takes longer to reach
full connectivity. The first beacon origination can happen up to 5
seconds later, as the first attempt may result in a failed service
resolution attempt that needs to time out.
The CI pipeline would sleep for a fixed duration of 10 seconds which
is just enough to reach connectivity with the ideal startup sequence,
but if the processes randomly start in a bad order, it can fail.

In the docker-based topology setup this issue does not appear, simply
because the process startup (in particular for the servers of the e2e
tests) is so much slower that there is ample time to reach connectivity.

This change replaces the fixed sleep duration with a script, based on
the control service's segments API, that explicitly waits for full
connectivity, with a timeout.
Remove the hand-optimized startup sequence for supervisor tasks from
scion.sh -- a randomly bad startup sequence can and should be tolerated.

Also, more cleanup for scion.sh: don't set up ipv6 local addresses when
running docker setup. Always perform the corresponding cleanup.

This change is

It has previously been observed that the integration tests are flaky in the supervisord-based setup unless we keep the specific startup sequence that starts dispatcher and routers before all other services (see commit b94c063, or also scionproto#4164). If the integration tests start before the control services have established segments for full connectivity, i.e. offer a least one path to every destination, the tests will fail. When control services start before routers, it takes longer to reach full connectivity. The first beacon origination can happen up to 5 seconds later, as the first attempt may result in a failed service resolution attempt that needs to time out. The CI pipeline would sleep for a fixed duration of 10 seconds which is just enough to reach connectivity with the ideal startup sequence, but if the processes randomly start in a bad order, it can fail. In the docker-based topology setup this issue does not appear, simply because the process startup (in particular for the servers of the e2e tests) is so much slower that there is ample time to reach connectivity. This change replaces the fixed sleep duration with a script, based on the control service's segments API, that explicitly waits for full connectivity, with a timeout. Remove the hand-optimized startup sequence for supervisor tasks from scion.sh -- a randomly bad startup sequence can and should be tolerated. Also, more cleanup for scion.sh: don't set up ipv6 local addresses when running docker setup. Always perform the corresponding cleanup.

oncilla

Reviewed 3 of 3 files at r1, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @matzf)

a discussion (no related file):
Cool stuff 💯

oncilla

Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @matzf)

oncilla approved these changes Mar 17, 2022

View reviewed changes

oncilla closed this in fc662ed Mar 18, 2022

matzf mentioned this pull request Jun 10, 2022

E2E failing links test is flaky #4215

Open

matzf deleted the cleanup-topo-run branch June 16, 2022 14:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tools: await-connectivity to fix CI flakes #4168

tools: await-connectivity to fix CI flakes #4168

matzf commented Mar 16, 2022 •

edited by worxli

Loading

oncilla left a comment

oncilla left a comment

tools: await-connectivity to fix CI flakes #4168

tools: await-connectivity to fix CI flakes #4168

Conversation

matzf commented Mar 16, 2022 • edited by worxli Loading

oncilla left a comment

Choose a reason for hiding this comment

oncilla left a comment

Choose a reason for hiding this comment

matzf commented Mar 16, 2022 •

edited by worxli

Loading