Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[#1757] feat(server): Add block number check on getting shuffle result #1758

Merged
merged 9 commits into from
Jun 3, 2024

Conversation

zuston
Copy link
Member

@zuston zuston commented May 30, 2024

What changes were proposed in this pull request?

Add block number check on getting shuffle result

Why are the changes needed?

Data validation, ensure data stable and correct

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing tests.

Copy link

github-actions bot commented May 30, 2024

Test Results

 2 433 files  +8   2 433 suites  +8   5h 0m 9s ⏱️ + 3m 32s
   934 tests ±0     933 ✅ +1   1 💤 ±0  0 ❌ ±0 
10 828 runs  +8  10 814 ✅ +9  14 💤 ±0  0 ❌ ±0 

Results for commit 1c7908e. ± Comparison against base commit d182a03.

♻️ This comment has been updated with latest results.

@@ -198,6 +201,25 @@ public Set<Integer> getShuffleIds() {
return partitionDataSizes.keySet();
}

public void incrBlockNumber(int shuffleId, int partitionId) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

incrBlockNumber -> incBlockNumber

This should be a better naming.

Copy link
Contributor

@EnricoMi EnricoMi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In which scenario do you expect the block number to disalign? What are you trying to guard against here? One side we add block ids and on the other side we retrieve block ids. You suspect incorrect usage of the API by the client or a bug server side? Sounds like that should be guarded against by testing so we are confident, not in production runtime.

@zuston
Copy link
Member Author

zuston commented May 30, 2024

In which scenario do you expect the block number to disalign? What are you trying to guard against here? One side we add block ids and on the other side we retrieve block ids. You suspect incorrect usage of the API by the client or a bug server side? Sounds like that should be guarded against by testing so we are confident, not in production runtime.

We suspect the race condition may exist on the process of reporting shuffle result, which may cause the blockId miss. But this is not determinzed, and so I want to introduce the extra blockId counter check for every partition. Anyway, fast fail is better than the silcent incorrect succeed.

@jerqi
Copy link
Contributor

jerqi commented May 30, 2024

Maybe addFinishedBlockIds can return int. We can add some logs if we receive some duplicated blocks.

@zuston
Copy link
Member Author

zuston commented May 30, 2024

Maybe addFinishedBlockIds can return int. We can add some logs if we receive some duplicated blocks.

Yes. Let me optimize this part tomorrow.

@@ -506,14 +506,24 @@ public void reportShuffleResult(
"appId[" + appId + "], shuffleId[" + shuffleId + "], taskAttemptId[" + taskAttemptId + "]";

try {
int expected =
partitionToBlockIds.values().stream().map(x -> x.length).reduce(0, (a, b) -> a + b);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is more readable:
int expected = partitionToBlockIds.values().stream().mapToInt(x -> x.length).sum();

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

@@ -506,14 +506,24 @@ public void reportShuffleResult(
"appId[" + appId + "], shuffleId[" + shuffleId + "], taskAttemptId[" + taskAttemptId + "]";

try {
int expected =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe expectedNum and updatedNum?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needn't. The type has indicated everything.

shuffleServer
.getShuffleTaskManager()
.addFinishedBlockIds(appId, shuffleId, partitionToBlockIds, bitmapNum);
"Accepted {} blockIds report of {} partitions as shuffle result for the task of {}",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the log could be more readable:

LOG.info(
          "Accepted blockIds report for {} blocks across {} partitions as shuffle result for task {}",
          expectedNum,
          partitionToBlockIds.size(),
          request);  

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

expected,
partitionToBlockIds.size(),
request);
int updated =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated -> updatedBlockCount

shuffleServer
.getShuffleTaskManager()
.addFinishedBlockIds(appId, shuffleId, partitionToBlockIds, bitmapNum);
if (expected != updated) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

expected -> expectedBlockCount

bitmap.addLong(blockId);
if (!bitmap.contains(blockId)) {
bitmap.addLong(blockId);
updated++;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refer to @jerqi 's suggestion, you may also rename this:
updatedBlockCount
totalUpdatedBlockCount

Copy link
Contributor

@jerqi jerqi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@zuston
Copy link
Member Author

zuston commented Jun 3, 2024

Thanks everyone for your review. Merged.

@zuston zuston merged commit fd64d9d into apache:master Jun 3, 2024
41 checks passed
zuston added a commit to zuston/incubator-uniffle that referenced this pull request Jun 3, 2024
Copy link
Member

@maobaolong maobaolong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zuston I have couple of question inline, would you please take a look?

shuffleServer
.getShuffleTaskManager()
.addFinishedBlockIds(appId, shuffleId, partitionToBlockIds, bitmapNum);
"Accepted blockIds report for {} blocks across {} partitions as shuffle result for task {}",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This log is verbose and fill server log, is there a better way to reduce it or do you think DEBUG level can be better?

.addFinishedBlockIds(appId, shuffleId, partitionToBlockIds, bitmapNum);
if (expectedBlockCount != updatedBlockCount) {
LOG.warn(
"Existing {} duplicated blockIds on blockId report for appId: {}, shuffleId: {}",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, we encountered this warn level log these days, it make us stressed.

Is there indicate something wrong? Can you explain what rare condition scenario can make updatedBlockCount less than expectedBlockCount?

Why there are duplicated blockIds on blockId report?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants