fix response_mask index #60

huiyeruzhou · 2024-12-20T10:30:29Z

Issue Overview

Hello, and thank you for your hard work on this project! While using verl, I noticed a calculation error in the compute_data_metric function within ray_trainer. Specifically, the function uses a 2D int tensor instead of the expected bool tensor for indexing. This discrepancy, based on PyTorch's advanced indexing mechanism, results in a 3D array rather than the expected 1D array.

Problem Details

The core issue lies in the behavior of PyTorch advanced indexing when using non-boolean tensors:

Incorrect Mask Behavior:
- Advanced indexing treats the provided 2D tensor as integer-based indices instead of a mask.
- As a result, indexing operates row-by-row, failing to correctly mask elements in the same row that are not part of the desired response.
Excessive Memory Usage:
- The resulting data shape unexpectedly inflates from batch_size * sequence_length to batch_size * sequence_length * sequence_length.
- For large sequence lengths (e.g., 16k), this causes the program to exceed available memory (CPU OOM), leading to worker termination. The system may output the following cryptic error:
```
A worker died or was killed while executing a task by an unexpected system error.
To troubleshoot the problem, check the logs for the dead worker...
```

Minimal Reproduction Code

Below is a minimal code example that demonstrates the issue:

import torch

# 定义一个二维张量
x = torch.tensor([
    [10, 20, 30],
    [40, 50, 60],
    [70, 80, 90]
])

# 定义一个二维掩码
mask = torch.tensor([
    [1, 0, 1],
    [0, 1, 0],
    [1, 1, 1]
])

print("原始张量:")
print(x)
print("掩码:")
print(mask)
print("布尔索引结果:")
print(x[mask.bool()])
# tensor([10, 30, 50, 70, 80, 90])
print("直接索引结果：")
print(x[mask])
# tensor([[[40, 50, 60],
#          [10, 20, 30],
#          [40, 50, 60]],

#         [[10, 20, 30],
#          [40, 50, 60],
#          [10, 20, 30]],

#         [[40, 50, 60],
#          [40, 50, 60],
#          [40, 50, 60]]])

Proposed Fix

To resolve the issue, simply add .bool() to convert response_mask into mask tensor as expected.

PeterSH6 · 2024-12-21T03:48:36Z

Hi @huiyeruzhou, thanks for your contribution!

Your fix for the metrics looks good to me.

PeterSH6 · 2024-12-21T05:22:56Z

It seems that the ci has some problems. I'll merge this branch after this PR #59 is merged. Sorry about that.

fix response_mask index

05b688a

PeterSH6 merged commit d46eb7d into volcengine:main Dec 22, 2024
1 of 2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix response_mask index #60

fix response_mask index #60

huiyeruzhou commented Dec 20, 2024

PeterSH6 commented Dec 21, 2024

PeterSH6 commented Dec 21, 2024

fix response_mask index #60

fix response_mask index #60

Conversation

huiyeruzhou commented Dec 20, 2024

Issue Overview

Problem Details

Minimal Reproduction Code

Proposed Fix

PeterSH6 commented Dec 21, 2024

PeterSH6 commented Dec 21, 2024