Flaky multi-node tests #134

koivunej · 2020-04-02T09:58:09Z

Running the pubsub tests repeatedly like:

while target/debug/deps/pubsub-c48a8707038c37a4 publish_between_two_nodes; do echo; echo '...'; echo; done

(Your executable name will vary, use cargo test -- publish_between_two_nodes to find out)

With all logging at level at trace this does produce at least two kinds of failures. Seen errors:

[2020-04-02T09:41:35Z DEBUG libp2p_tcp] Dropped TCP connection to undeterminate peer when the 10s timeout started
[2020-04-02T09:41:08Z DEBUG libp2p_secio] error during secio handshake IoError(Os { code: 104, kind: ConnectionReset, message: "Connection reset by peer" }) even after sending pubsub rpc's back and forth but before the 10s timeout started

In a highly unscientific "lets run these in multiple shells until they fail, then inspect reason" both seem to happen as often. I think I'll just reference this comment in a tracking issue, this might already be better handled with the just released 0.17 of libp2p. Being resilient on disconnects in the test is one possibility but I am used to running tests in Linux and Windows over loopback without much of an issue. Then again, my systems have probably never been as loaded as any public CI server.

By running other tests than the pubsub_between_two_nodes they can also fail but I didn't recover any information on those as 5 tests were executed in parallel and only one of them turned logging on so let it be a small data point.

Originally posted by @koivunej in #133 (comment)

The text was updated successfully, but these errors were encountered:

koivunej · 2020-06-08T17:01:30Z

For some reason the tests have started getting better. This might be because of libp2p 0.19 or just random luck with the other builds running on gha.

ljedrz · 2020-07-23T13:19:26Z

Currently the exchange_block test likes to fail; I'll look into it.

ljedrz · 2020-07-23T14:31:13Z

I've found a semi-reliable way of reproducing the issue; I'm suspecting insufficient node cleanup to be the root cause.

ljedrz · 2020-07-23T15:28:23Z

Ok, so it might require a bit of a broader solution, as even though my fix seems to be doing the trick, it sometimes causes some async deadlocking, which is not ideal either 😄. I have some ideas, will explore them shortly.

271: Adjust subscription locking r=ljedrz a=koivunej Fixes the number of critical sections in SubscriptionFuture from 2 to 1, helping out with the hangs in exchange_block which have been recorded in #134. Also backports the `related_subs` cleanup from #264. Co-authored-by: Joonas Koivunen <joonas@equilibrium.co>

koivunej · 2020-09-05T14:06:34Z

These have not been flaky at all recently, nor sure is that because of GHA architecture or #307 for example. Lets reopen we get more of these.

aphelionz added bug Something isn't working CI labels Apr 7, 2020

This was referenced Jul 28, 2020

Initial content discovery #260

Merged

update remaining deps #261

Merged

Adjust subscription locking #271

Merged

koivunej closed this as completed Sep 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flaky multi-node tests #134

Flaky multi-node tests #134

koivunej commented Apr 2, 2020

koivunej commented Jun 8, 2020

ljedrz commented Jul 23, 2020

ljedrz commented Jul 23, 2020 •

edited

Loading

ljedrz commented Jul 23, 2020 •

edited

Loading

koivunej commented Sep 5, 2020

Flaky multi-node tests #134

Flaky multi-node tests #134

Comments

koivunej commented Apr 2, 2020

koivunej commented Jun 8, 2020

ljedrz commented Jul 23, 2020

ljedrz commented Jul 23, 2020 • edited Loading

ljedrz commented Jul 23, 2020 • edited Loading

koivunej commented Sep 5, 2020

ljedrz commented Jul 23, 2020 •

edited

Loading

ljedrz commented Jul 23, 2020 •

edited

Loading