Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: adding cross_contract_calls integration test for the V1->V2 resharding #9402

Merged
merged 15 commits into from
Aug 17, 2023

Conversation

wacban
Copy link
Contributor

@wacban wacban commented Aug 9, 2023

The goal of this PR is to refactor the existing cross_contract_calls integration test so that it can be reused for testing the new resharding and do that. I have encountered numerous issues when trying to adapt that test because enabling the new resharding also enables all of other protocol features since simple nightshade. The test wasn't quite ready for it. I managed to fix a few issues but not all. I'm leaving it unfinished for now as I suspect the current issue may be due to lack of flat storage support so it's better to wait for that. I marked the new test as #[ignore]. I would still like to merge in all of the refactoring, fixes and improvements I made so far and I will pick it up again once flat storage is supported in resharding.

Some of the new features are:

  • I refactored the cross_contract_calls so that it can handle either resharding type.
  • I allowed not all clients to process all transactions but also route them. This was needed only before I set all clients to track all shards but is correct, handles both cases and it more future proof.

Some of the challenges were:

  • The test was failing with missing chunk error. This was because the "Partial Encoded Chunk Forward" messages were not correctly handled in the test infra. I fixed this issue in TestEnv::process_partial_encoded_chunks. There are two parts of the fix:
    • handle the forwards
    • keeping looping until all messages are processed - this is needed because one message can trigger sending of another message
  • The protocol version was not getting upgraded and because of that the resharding was not getting triggered (or at least not deterministically where I wanted it to happen). This was because of unlucky block producer assignment where one block producer (out of 4) does not actually get to produce any blocks in the (rather short) first epoch. When that happens voting for the new protocol version is stuck at 3/4 = 75% votes which is lower than the default threshold. I lowered the threshold in the test to work around that.
  • The state sync process was getting triggered and failing because of lack of some event loop or other async framework setup. I fixed that by setting all nodes to track all shards. (not an issue in V0->V1 because when there is 1 shard everyone tracks it and apparently the node keep tracking the new shards as well).

@wacban wacban marked this pull request as ready for review August 9, 2023 17:02
@wacban wacban requested a review from a team as a code owner August 9, 2023 17:02
@robin-near
Copy link
Contributor

keeping looping until all messages are processed - this is needed because one message can trigger sending of another message

This is what the TestLoop does. It requires of course some sort of a migration to make this test work in that framework.

chain/chunks/src/lib.rs Show resolved Hide resolved
Comment on lines 155 to 158
debug!(target: "chunks", "Reconstructed and decoded chunk {}, encoded length was {}, num txs: {}, I'm {:?}", chunk_hash.0, encoded_chunk.encoded_length(), shard_chunk.transactions().len(), me);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is removed by accident I will revert it

chain/client/src/test_utils.rs Show resolved Hide resolved
@wacban
Copy link
Contributor Author

wacban commented Aug 10, 2023

keeping looping until all messages are processed - this is needed because one message can trigger sending of another message

This is what the TestLoop does. It requires of course some sort of a migration to make this test work in that framework.

Ah, I see now what you mean, there is an existing framework for doing this in tests. I will have a look but for now I would rather merge this as is and follow up with improvements later.

@@ -61,4 +61,4 @@ rusty-tags.vi
# Estimator generated files
costs-*.txt
names-to-stats.txt
data_dump_*.bin
data_dump_*.bin
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we should not be changing this file

chain/chain/src/chain.rs Show resolved Hide resolved
client.epoch_manager.get_epoch_id_from_prev_block(&head.last_block_hash).unwrap();
let block_producer =
client.epoch_manager.get_block_producer(&epoch_id, height).unwrap();
let _span = tracing::debug_span!(target: "test", "", client=?block_producer);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can directly call .entered() here? Can add some text to the span as well or remove altogether.

chain/client/src/test_utils.rs Show resolved Hide resolved
Copy link
Contributor

@walnut-the-cat walnut-the-cat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: looking at the changes, PR includes a lot of refactoring which is not necessarily directly related to title of the PR. It may be better to separate refactoring part out as its own PR for better review-ability

chain/chunks/src/lib.rs Show resolved Hide resolved
Comment on lines -242 to -244
let block_producer_client = env.client(&block_producer);
let mut block = block_producer_client.produce_block(height).unwrap().unwrap();
set_block_protocol_version(&mut block, block_producer.clone(), protocol_version);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just moved into {} block for block var

@wacban
Copy link
Contributor Author

wacban commented Aug 11, 2023

nit: looking at the changes, PR includes a lot of refactoring which is not necessarily directly related to title of the PR. It may be better to separate refactoring part out as its own PR for better review-ability

Generally I agree but I struggle to make that work in github. I couldn't get PRs to be stacked on top of each other, if anyone knows how to do it right I'm all ears. Some of the refactoring is relevant here, like process_tx is replaced some other code in second location.

@jakmeier
Copy link
Contributor

nit: looking at the changes, PR includes a lot of refactoring which is not necessarily directly related to title of the PR. It may be better to separate refactoring part out as its own PR for better review-ability

Generally I agree but I struggle to make that work in github. I couldn't get PRs to be stacked on top of each other, if anyone knows how to do it right I'm all ears. Some of the refactoring is relevant here, like process_tx is replaced some other code in second location.

I usually just submit the refactoring as a separate PR, with a comment that it's prep for another PR. Usually I don't submit the second PR yet because it requires the changes from the first PR.

When I want to submit them both together for better context, I like to submit a dummy PR from the second branch targeting the first PR branch. This way people can review the changes of the second PR in isolation.

In any case, changes after reviewing PR0 will mess things up for PR1 and after the squash-merge you will have to rebase + force-push PR1 anyway. So I usually just wait with PR1 until PR0 has landed.

Copy link
Contributor

@jakmeier jakmeier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks ok :)

Tbh, some of the refactored code could probably use more refactoring. But there could always be more refactoring. Your changes look like an overall improvement to me.

// Test cross contract calls
// This test case tests when there are missing chunks in the produced blocks
// This is to test that all the chunk management logic in sharding split is correct
fn test_shard_layout_upgrade_missing_chunks(p_missing: f64) {
init_test_logger();

let resharding_type = ReshardingType::V1;
let genesis_protocol_versino = get_genesis_protocol_version(&resharding_type);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let genesis_protocol_versino = get_genesis_protocol_version(&resharding_type);
let genesis_protocol_version = get_genesis_protocol_version(&resharding_type);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You never know, maybe versino is Spanish for version? ;)
Thanks, good catch, will fix soon.

chain/chunks/Cargo.toml Show resolved Hide resolved
@@ -18,6 +18,7 @@ rand_hc.workspace = true
serde_json.workspace = true
smart-default.workspace = true
tracing.workspace = true
itertools.workspace = true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here: are you actually using itertools in the code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I usually use collect_vec when printing debug logs. How would you feel about leaving that in?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally speaking, I don't really like adding dependencies that aren't used in committed code. It's the kind of thing I personally might create a PR, just to remove a dead dependency. So at least a comment why it's in would be useful to prevent that sort of engineers canceling out each others "improvements".

But that said, itertools wouldn't be the worst to have pulled in for no reason, so I won't stop you from adding it if you hate writing collect::<Vec<_>> that much :D

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, thanks, I'll leave that in and add a comment.

Comment on lines 994 to 1002
#[cfg(feature = "protocol_feature_simple_nightshade_v2")]
// TODO(resharding) this test is currently broken, potentially due to lack of
// flat storage support. Once flat storage for resharding is fully implemented
// this test should be revisited, fixed and re-enabled.
#[ignore]
#[test]
fn test_shard_layout_upgrade_cross_contract_calls_v2() {
test_shard_layout_upgrade_cross_contract_calls_impl(ReshardingType::V2);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an issue you could link here where readers can figure out or ask about the progress on fixing this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few different issues tracking the flat storage progress but let me link the resharding tracking issue.

@near-bulldozer near-bulldozer bot merged commit 4f599fd into master Aug 17, 2023
@near-bulldozer near-bulldozer bot deleted the waclaw-resharding-test branch August 17, 2023 13:02
nikurt pushed a commit to nikurt/nearcore that referenced this pull request Aug 24, 2023
…harding (near#9402)

The goal of this PR is to refactor the existing cross_contract_calls integration test so that it can be reused for testing the new resharding and do that. I have encountered numerous issues when trying to adapt that test because enabling the new resharding also enables all of other protocol features since simple nightshade. The test wasn't quite ready for it. I managed to fix a few issues but not all. I'm leaving it unfinished for now as I suspect the current issue may be due to lack of flat storage support so it's better to wait for that. I marked the new test as #[ignore]. I would still like to merge in all of the refactoring, fixes and improvements I made so far and I will pick it up again once flat storage is supported in resharding. 

Some of the new features are:
- I refactored the cross_contract_calls so that it can handle either resharding type.
- I allowed not all clients to process all transactions but also route them. This was needed only before I set all clients to track all shards but is correct, handles both cases and it more future proof. 
 
Some of the challenges were:
- The test was failing with missing chunk error. This was because the "Partial Encoded Chunk Forward" messages were not correctly handled in the test infra. I fixed this issue in TestEnv::process_partial_encoded_chunks. There are two parts of the fix:
  - handle the forwards
  - keeping looping until all messages are processed - this is needed because one message can trigger sending of another message
- The protocol version was not getting upgraded and because of that the resharding was not getting triggered (or at least not deterministically where I wanted it to happen). This was because of unlucky block producer assignment where one block producer (out of 4) does not actually get to produce any blocks in the (rather short) first epoch. When that happens voting for the new protocol version is stuck at 3/4 = 75% votes which is lower than the default threshold. I lowered the threshold in the test to work around that. 
- The state sync process was getting triggered and failing because of lack of some event loop or other async framework setup. I fixed that by setting all nodes to track all shards. (not an issue in V0->V1 because when there is 1 shard everyone tracks it and apparently the node keep tracking the new shards as well).
nikurt pushed a commit that referenced this pull request Aug 28, 2023
…harding (#9402)

The goal of this PR is to refactor the existing cross_contract_calls integration test so that it can be reused for testing the new resharding and do that. I have encountered numerous issues when trying to adapt that test because enabling the new resharding also enables all of other protocol features since simple nightshade. The test wasn't quite ready for it. I managed to fix a few issues but not all. I'm leaving it unfinished for now as I suspect the current issue may be due to lack of flat storage support so it's better to wait for that. I marked the new test as #[ignore]. I would still like to merge in all of the refactoring, fixes and improvements I made so far and I will pick it up again once flat storage is supported in resharding. 

Some of the new features are:
- I refactored the cross_contract_calls so that it can handle either resharding type.
- I allowed not all clients to process all transactions but also route them. This was needed only before I set all clients to track all shards but is correct, handles both cases and it more future proof. 
 
Some of the challenges were:
- The test was failing with missing chunk error. This was because the "Partial Encoded Chunk Forward" messages were not correctly handled in the test infra. I fixed this issue in TestEnv::process_partial_encoded_chunks. There are two parts of the fix:
  - handle the forwards
  - keeping looping until all messages are processed - this is needed because one message can trigger sending of another message
- The protocol version was not getting upgraded and because of that the resharding was not getting triggered (or at least not deterministically where I wanted it to happen). This was because of unlucky block producer assignment where one block producer (out of 4) does not actually get to produce any blocks in the (rather short) first epoch. When that happens voting for the new protocol version is stuck at 3/4 = 75% votes which is lower than the default threshold. I lowered the threshold in the test to work around that. 
- The state sync process was getting triggered and failing because of lack of some event loop or other async framework setup. I fixed that by setting all nodes to track all shards. (not an issue in V0->V1 because when there is 1 shard everyone tracks it and apparently the node keep tracking the new shards as well).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants