some integration tests are broken on macos #9367

wacban · 2023-07-28T10:50:05Z

There is a number of integration tests that don't work on macOS but work just fine on Linux. This is terrible experience for macOS developers (e.g. majority of Pagoda employees) who ever need to debug either of those tests. I was caught off-guard by this when working on resharding where I actually broke the test but then spend ages trying to debug it only to find that the test is still broken after reverting all of my changes. Sure enough there are workarounds such as getting linux laptopts or vm in the cloud but as long as the default development platform is macOS I believe it should be supported.

I believe there to be two categories of root-cause:

runtime which is the most significant differentiator between platforms
network where I have no clue what is different

The ideal outcome for this issue would be fixing all of the failing tests and adding a buildkite / github action to execute them on macOS on all PRs.

The sub-optimal outcome for this issue would be disabling the failing tests so that at a minimum the developers immediately can tell that there is no point debugging on macOS.

here is a the repro command and results:

cargo nextest run -p integration-tests --no-fail-fast
     Summary [ 273.615s] 237 tests run: 227 passed (7 slow), 4 failed, 6 timed out, 88 skipped
     TIMEOUT [ 120.005s] integration-tests tests::client::features::increase_storage_compute_cost::test_non_storage_gas_exceeded
     TIMEOUT [ 120.005s] integration-tests tests::client::features::increase_storage_compute_cost::test_storage_write
     TIMEOUT [ 120.004s] integration-tests tests::client::sharding_upgrade::test_shard_layout_upgrade_cross_contract_calls
        FAIL [   4.441s] integration-tests tests::client::sharding_upgrade::test_shard_layout_upgrade_missing_chunks_high_missing_prob
     TIMEOUT [ 126.010s] integration-tests tests::client::sharding_upgrade::test_shard_layout_upgrade_missing_chunks_low_missing_prob
     TIMEOUT [ 124.002s] integration-tests tests::client::sharding_upgrade::test_shard_layout_upgrade_missing_chunks_mid_missing_prob
        FAIL [   0.428s] integration-tests tests::network::full_network::connect_on_almost_full_network
        FAIL [   0.376s] integration-tests tests::network::full_network::connect_on_full_network
        FAIL [   0.555s] integration-tests tests::network::peer_handshake::peers_connect_all
     TIMEOUT [ 120.005s] integration-tests tests::runtime::test_evil_contracts::test_evil_deep_trie

cc @nagisa for runtime and @saketh-are for network

The text was updated successfully, but these errors were encountered:

nagisa · 2023-07-31T11:20:01Z

So, without investigating too much further, the test timeouts seem hard to make resilient. Of course it would be a good idea to have tests that are nowhere near (heh) the 120 second time limit, so that even with a slower/faster runtime on a slower/faster hardware these tests succeed. It is also just plain annoying to wait on the long tail of the tests that execute for dozens of seconds each.

Though it is also possible that these tests have deadlocked or somesuch.

…t` (#9608) I got absolutely fed up with waiting for prerequisite infrastructure work for #9290 (why does it have to be _that_ hard?) However the PR in question had some other important improvements that do not necessarily *rely* on changes to said infrastructure to work. In particular: * `nextest` now retries failing tests a few times before giving up, to make sure they aren't spuriously failing; * This should help some timing out integration tests in particular, as those often fail because of some deadlockish situation in my experience. * style checks still run with `cargo nextest` – unfortunately that means that in CI this will run these checks multiple times, but that doesn’t sound particularly terrible of a tradeoff (especially if the other changes mean we won't be retrying entire test suites as often anymore;) * This should allow for a greater portion of the test suite to run on Macs – unfortunately not verified by the CI, but people do complain and this should make the situation better. cc #9367

wacban · 2024-03-11T10:05:23Z

I believe this is done now, feel free to reopen if needed.

wacban added A-network Area: Network T-core Team: issues relevant to the core team A-contract-runtime Area: contract compilation and execution, virtual machines, etc T-contract-runtime Team: issues relevant to the contract runtime team labels Jul 28, 2023

nagisa mentioned this issue Aug 11, 2023

[DX] There isn’t a single local command to ensure all CI will pass #9415

Closed

nagisa mentioned this issue Sep 28, 2023

ci: improve test resilience and run style checks within cargo nextest #9608

Merged

wacban closed this as completed Mar 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

some integration tests are broken on macos #9367

some integration tests are broken on macos #9367

wacban commented Jul 28, 2023

nagisa commented Jul 31, 2023

wacban commented Mar 11, 2024

some integration tests are broken on macos #9367

some integration tests are broken on macos #9367

Comments

wacban commented Jul 28, 2023

nagisa commented Jul 31, 2023

wacban commented Mar 11, 2024