feat: adding cross_contract_calls integration test for the V1->V2 resharding #9402

wacban · 2023-08-09T16:07:30Z

The goal of this PR is to refactor the existing cross_contract_calls integration test so that it can be reused for testing the new resharding and do that. I have encountered numerous issues when trying to adapt that test because enabling the new resharding also enables all of other protocol features since simple nightshade. The test wasn't quite ready for it. I managed to fix a few issues but not all. I'm leaving it unfinished for now as I suspect the current issue may be due to lack of flat storage support so it's better to wait for that. I marked the new test as #[ignore]. I would still like to merge in all of the refactoring, fixes and improvements I made so far and I will pick it up again once flat storage is supported in resharding.

Some of the new features are:

I refactored the cross_contract_calls so that it can handle either resharding type.
I allowed not all clients to process all transactions but also route them. This was needed only before I set all clients to track all shards but is correct, handles both cases and it more future proof.

Some of the challenges were:

The test was failing with missing chunk error. This was because the "Partial Encoded Chunk Forward" messages were not correctly handled in the test infra. I fixed this issue in TestEnv::process_partial_encoded_chunks. There are two parts of the fix:
- handle the forwards
- keeping looping until all messages are processed - this is needed because one message can trigger sending of another message
The protocol version was not getting upgraded and because of that the resharding was not getting triggered (or at least not deterministically where I wanted it to happen). This was because of unlucky block producer assignment where one block producer (out of 4) does not actually get to produce any blocks in the (rather short) first epoch. When that happens voting for the new protocol version is stuck at 3/4 = 75% votes which is lower than the default threshold. I lowered the threshold in the test to work around that.
The state sync process was getting triggered and failing because of lack of some event loop or other async framework setup. I fixed that by setting all nodes to track all shards. (not an issue in V0->V1 because when there is 1 shard everyone tracks it and apparently the node keep tracking the new shards as well).

robin-near · 2023-08-09T17:03:42Z

keeping looping until all messages are processed - this is needed because one message can trigger sending of another message

This is what the TestLoop does. It requires of course some sort of a migration to make this test work in that framework.

chain/chunks/src/lib.rs

wacban · 2023-08-09T17:07:31Z

chain/chunks/src/logic.rs

-        debug!(target: "chunks", "Reconstructed and decoded chunk {}, encoded length was {}, num txs: {}, I'm {:?}", chunk_hash.0, encoded_chunk.encoded_length(), shard_chunk.transactions().len(), me);
-


this is removed by accident I will revert it

chain/client/src/test_utils.rs

wacban · 2023-08-10T06:37:28Z

keeping looping until all messages are processed - this is needed because one message can trigger sending of another message

This is what the TestLoop does. It requires of course some sort of a migration to make this test work in that framework.

Ah, I see now what you mean, there is an existing framework for doing this in tests. I will have a look but for now I would rather merge this as is and follow up with improvements later.

shreyan-gupta · 2023-08-10T05:58:22Z

.gitignore

@@ -61,4 +61,4 @@ rusty-tags.vi
 # Estimator generated files
 costs-*.txt
 names-to-stats.txt
-data_dump_*.bin
+data_dump_*.bin


nit: we should not be changing this file

chain/chain/src/chain.rs

integration-tests/src/tests/client/sharding_upgrade.rs

shreyan-gupta · 2023-08-10T06:34:18Z

integration-tests/src/tests/client/sharding_upgrade.rs

+                client.epoch_manager.get_epoch_id_from_prev_block(&head.last_block_hash).unwrap();
+            let block_producer =
+                client.epoch_manager.get_block_producer(&epoch_id, height).unwrap();
+            let _span = tracing::debug_span!(target: "test", "", client=?block_producer);


nit: can directly call .entered() here? Can add some text to the span as well or remove altogether.

chain/client/src/test_utils.rs

walnut-the-cat

nit: looking at the changes, PR includes a lot of refactoring which is not necessarily directly related to title of the PR. It may be better to separate refactoring part out as its own PR for better review-ability

chain/chunks/src/lib.rs

integration-tests/src/tests/client/sharding_upgrade.rs

walnut-the-cat · 2023-08-10T21:09:01Z

integration-tests/src/tests/client/sharding_upgrade.rs

-        let block_producer_client = env.client(&block_producer);
-        let mut block = block_producer_client.produce_block(height).unwrap().unwrap();
-        set_block_protocol_version(&mut block, block_producer.clone(), protocol_version);


just moved into {} block for block var

wacban · 2023-08-11T06:52:34Z

nit: looking at the changes, PR includes a lot of refactoring which is not necessarily directly related to title of the PR. It may be better to separate refactoring part out as its own PR for better review-ability

Generally I agree but I struggle to make that work in github. I couldn't get PRs to be stacked on top of each other, if anyone knows how to do it right I'm all ears. Some of the refactoring is relevant here, like process_tx is replaced some other code in second location.

jakmeier · 2023-08-16T09:31:57Z

nit: looking at the changes, PR includes a lot of refactoring which is not necessarily directly related to title of the PR. It may be better to separate refactoring part out as its own PR for better review-ability

Generally I agree but I struggle to make that work in github. I couldn't get PRs to be stacked on top of each other, if anyone knows how to do it right I'm all ears. Some of the refactoring is relevant here, like process_tx is replaced some other code in second location.

I usually just submit the refactoring as a separate PR, with a comment that it's prep for another PR. Usually I don't submit the second PR yet because it requires the changes from the first PR.

When I want to submit them both together for better context, I like to submit a dummy PR from the second branch targeting the first PR branch. This way people can review the changes of the second PR in isolation.

In any case, changes after reviewing PR0 will mess things up for PR1 and after the squash-merge you will have to rebase + force-push PR1 anyway. So I usually just wait with PR1 until PR0 has landed.

jakmeier

Mostly looks ok :)

Tbh, some of the refactored code could probably use more refactoring. But there could always be more refactoring. Your changes look like an overall improvement to me.

jakmeier · 2023-08-17T11:40:39Z

integration-tests/src/tests/client/sharding_upgrade.rs

 // Test cross contract calls
 // This test case tests when there are missing chunks in the produced blocks
 // This is to test that all the chunk management logic in sharding split is correct
 fn test_shard_layout_upgrade_missing_chunks(p_missing: f64) {
    init_test_logger();

+    let resharding_type = ReshardingType::V1;
+    let genesis_protocol_versino = get_genesis_protocol_version(&resharding_type);


Suggested change

let genesis_protocol_versino = get_genesis_protocol_version(&resharding_type);

let genesis_protocol_version = get_genesis_protocol_version(&resharding_type);

You never know, maybe versino is Spanish for version? ;)
Thanks, good catch, will fix soon.

chain/chunks/Cargo.toml

jakmeier · 2023-08-17T11:49:18Z

chain/epoch-manager/Cargo.toml

@@ -18,6 +18,7 @@ rand_hc.workspace = true
 serde_json.workspace = true
 smart-default.workspace = true
 tracing.workspace = true
+itertools.workspace = true


same here: are you actually using itertools in the code?

Yeah I usually use collect_vec when printing debug logs. How would you feel about leaving that in?

Generally speaking, I don't really like adding dependencies that aren't used in committed code. It's the kind of thing I personally might create a PR, just to remove a dead dependency. So at least a comment why it's in would be useful to prevent that sort of engineers canceling out each others "improvements".

But that said, itertools wouldn't be the worst to have pulled in for no reason, so I won't stop you from adding it if you hate writing collect::<Vec<_>> that much :D

Ok, thanks, I'll leave that in and add a comment.

jakmeier · 2023-08-17T12:05:30Z

integration-tests/src/tests/client/sharding_upgrade.rs

+#[cfg(feature = "protocol_feature_simple_nightshade_v2")]
+// TODO(resharding) this test is currently broken, potentially due to lack of
+// flat storage support. Once flat storage for resharding is fully implemented
+// this test should be revisited, fixed and re-enabled.
+#[ignore]
+#[test]
+fn test_shard_layout_upgrade_cross_contract_calls_v2() {
+    test_shard_layout_upgrade_cross_contract_calls_impl(ReshardingType::V2);
+}


Is there an issue you could link here where readers can figure out or ask about the progress on fixing this?

There are a few different issues tracking the flat storage progress but let me link the resharding tracking issue.

…harding (near#9402) The goal of this PR is to refactor the existing cross_contract_calls integration test so that it can be reused for testing the new resharding and do that. I have encountered numerous issues when trying to adapt that test because enabling the new resharding also enables all of other protocol features since simple nightshade. The test wasn't quite ready for it. I managed to fix a few issues but not all. I'm leaving it unfinished for now as I suspect the current issue may be due to lack of flat storage support so it's better to wait for that. I marked the new test as #[ignore]. I would still like to merge in all of the refactoring, fixes and improvements I made so far and I will pick it up again once flat storage is supported in resharding. Some of the new features are: - I refactored the cross_contract_calls so that it can handle either resharding type. - I allowed not all clients to process all transactions but also route them. This was needed only before I set all clients to track all shards but is correct, handles both cases and it more future proof. Some of the challenges were: - The test was failing with missing chunk error. This was because the "Partial Encoded Chunk Forward" messages were not correctly handled in the test infra. I fixed this issue in TestEnv::process_partial_encoded_chunks. There are two parts of the fix: - handle the forwards - keeping looping until all messages are processed - this is needed because one message can trigger sending of another message - The protocol version was not getting upgraded and because of that the resharding was not getting triggered (or at least not deterministically where I wanted it to happen). This was because of unlucky block producer assignment where one block producer (out of 4) does not actually get to produce any blocks in the (rather short) first epoch. When that happens voting for the new protocol version is stuck at 3/4 = 75% votes which is lower than the default threshold. I lowered the threshold in the test to work around that. - The state sync process was getting triggered and failing because of lack of some event loop or other async framework setup. I fixed that by setting all nodes to track all shards. (not an issue in V0->V1 because when there is 1 shard everyone tracks it and apparently the node keep tracking the new shards as well).

…harding (#9402) The goal of this PR is to refactor the existing cross_contract_calls integration test so that it can be reused for testing the new resharding and do that. I have encountered numerous issues when trying to adapt that test because enabling the new resharding also enables all of other protocol features since simple nightshade. The test wasn't quite ready for it. I managed to fix a few issues but not all. I'm leaving it unfinished for now as I suspect the current issue may be due to lack of flat storage support so it's better to wait for that. I marked the new test as #[ignore]. I would still like to merge in all of the refactoring, fixes and improvements I made so far and I will pick it up again once flat storage is supported in resharding. Some of the new features are: - I refactored the cross_contract_calls so that it can handle either resharding type. - I allowed not all clients to process all transactions but also route them. This was needed only before I set all clients to track all shards but is correct, handles both cases and it more future proof. Some of the challenges were: - The test was failing with missing chunk error. This was because the "Partial Encoded Chunk Forward" messages were not correctly handled in the test infra. I fixed this issue in TestEnv::process_partial_encoded_chunks. There are two parts of the fix: - handle the forwards - keeping looping until all messages are processed - this is needed because one message can trigger sending of another message - The protocol version was not getting upgraded and because of that the resharding was not getting triggered (or at least not deterministically where I wanted it to happen). This was because of unlucky block producer assignment where one block producer (out of 4) does not actually get to produce any blocks in the (rather short) first epoch. When that happens voting for the new protocol version is stuck at 3/4 = 75% votes which is lower than the default threshold. I lowered the threshold in the test to work around that. - The state sync process was getting triggered and failing because of lack of some event loop or other async framework setup. I fixed that by setting all nodes to track all shards. (not an issue in V0->V1 because when there is 1 shard everyone tracks it and apparently the node keep tracking the new shards as well).

wacban requested review from shreyan-gupta and robin-near August 9, 2023 17:01

wacban marked this pull request as ready for review August 9, 2023 17:02

wacban requested a review from a team as a code owner August 9, 2023 17:02

wacban commented Aug 9, 2023

View reviewed changes

shreyan-gupta approved these changes Aug 10, 2023

View reviewed changes

walnut-the-cat reviewed Aug 10, 2023

View reviewed changes

wacban force-pushed the waclaw-resharding-test branch from 5bd6d6c to 506147b Compare August 15, 2023 07:32

jakmeier approved these changes Aug 17, 2023

View reviewed changes

wacban added 15 commits August 17, 2023 12:39

devlog

8319833

test script

7998eda

pass resharding type

0f6edbb

debug logs

368ba9d

it works

51f81a9

cleanup some debug logs

0302b06

reenable asserts - broken again

6b8b0ca

no tx

bb682a5

protocol version upgrade works predictably

d2b6e3f

cleanup

2809487

re-enable all transactions

b93c48a

disable the v2 test

5a101eb

cleanup

53087c4

nits

bd6a273

code review

8a91db0

wacban force-pushed the waclaw-resharding-test branch from 506147b to 8a91db0 Compare August 17, 2023 12:47

wacban added the S-automerge label Aug 17, 2023

near-bulldozer bot merged commit 4f599fd into master Aug 17, 2023

near-bulldozer bot deleted the waclaw-resharding-test branch August 17, 2023 13:02

wacban mentioned this pull request Sep 14, 2023

🔷 [Tracking issue] Resharding v2 #8992

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: adding cross_contract_calls integration test for the V1->V2 resharding #9402

feat: adding cross_contract_calls integration test for the V1->V2 resharding #9402

wacban commented Aug 9, 2023 •

edited

Loading

robin-near commented Aug 9, 2023

wacban Aug 9, 2023

wacban commented Aug 10, 2023

shreyan-gupta Aug 10, 2023

shreyan-gupta Aug 10, 2023

walnut-the-cat left a comment

walnut-the-cat Aug 10, 2023

wacban commented Aug 11, 2023

jakmeier commented Aug 16, 2023

jakmeier left a comment

jakmeier Aug 17, 2023

wacban Aug 17, 2023

jakmeier Aug 17, 2023

wacban Aug 17, 2023

jakmeier Aug 17, 2023

wacban Aug 17, 2023

jakmeier Aug 17, 2023

wacban Aug 17, 2023

		debug!(target: "chunks", "Reconstructed and decoded chunk {}, encoded length was {}, num txs: {}, I'm {:?}", chunk_hash.0, encoded_chunk.encoded_length(), shard_chunk.transactions().len(), me);

	let genesis_protocol_versino = get_genesis_protocol_version(&resharding_type);
	let genesis_protocol_version = get_genesis_protocol_version(&resharding_type);

feat: adding cross_contract_calls integration test for the V1->V2 resharding #9402

feat: adding cross_contract_calls integration test for the V1->V2 resharding #9402

Conversation

wacban commented Aug 9, 2023 • edited Loading

robin-near commented Aug 9, 2023

Choose a reason for hiding this comment

wacban commented Aug 10, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

walnut-the-cat left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wacban commented Aug 11, 2023

jakmeier commented Aug 16, 2023

jakmeier left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wacban commented Aug 9, 2023 •

edited

Loading