Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix response_mask index #60

Merged
merged 1 commit into from
Dec 22, 2024
Merged

fix response_mask index #60

merged 1 commit into from
Dec 22, 2024

Conversation

huiyeruzhou
Copy link
Contributor

Issue Overview

Hello, and thank you for your hard work on this project! While using verl, I noticed a calculation error in the compute_data_metric function within ray_trainer. Specifically, the function uses a 2D int tensor instead of the expected bool tensor for indexing. This discrepancy, based on PyTorch's advanced indexing mechanism, results in a 3D array rather than the expected 1D array.

Problem Details

The core issue lies in the behavior of PyTorch advanced indexing when using non-boolean tensors:

  1. Incorrect Mask Behavior:

    • Advanced indexing treats the provided 2D tensor as integer-based indices instead of a mask.
    • As a result, indexing operates row-by-row, failing to correctly mask elements in the same row that are not part of the desired response.
  2. Excessive Memory Usage:

    • The resulting data shape unexpectedly inflates from batch_size * sequence_length to batch_size * sequence_length * sequence_length.

    • For large sequence lengths (e.g., 16k), this causes the program to exceed available memory (CPU OOM), leading to worker termination. The system may output the following cryptic error:

      A worker died or was killed while executing a task by an unexpected system error.
      To troubleshoot the problem, check the logs for the dead worker...

Minimal Reproduction Code

Below is a minimal code example that demonstrates the issue:

import torch

# 定义一个二维张量
x = torch.tensor([
    [10, 20, 30],
    [40, 50, 60],
    [70, 80, 90]
])

# 定义一个二维掩码
mask = torch.tensor([
    [1, 0, 1],
    [0, 1, 0],
    [1, 1, 1]
])

print("原始张量:")
print(x)
print("掩码:")
print(mask)
print("布尔索引结果:")
print(x[mask.bool()])
# tensor([10, 30, 50, 70, 80, 90])
print("直接索引结果:")
print(x[mask])
# tensor([[[40, 50, 60],
#          [10, 20, 30],
#          [40, 50, 60]],

#         [[10, 20, 30],
#          [40, 50, 60],
#          [10, 20, 30]],

#         [[40, 50, 60],
#          [40, 50, 60],
#          [40, 50, 60]]])

Proposed Fix

To resolve the issue, simply add .bool() to convert response_mask into mask tensor as expected.

@PeterSH6
Copy link
Collaborator

Hi @huiyeruzhou, thanks for your contribution!

Your fix for the metrics looks good to me.

@PeterSH6
Copy link
Collaborator

It seems that the ci has some problems. I'll merge this branch after this PR #59 is merged. Sorry about that.

@PeterSH6 PeterSH6 merged commit d46eb7d into volcengine:main Dec 22, 2024
1 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants