Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PyTorch] Add context parallel support for packed dataset in THD format #9540

Closed
wants to merge 7 commits into from

Conversation

tomlifu
Copy link
Contributor

@tomlifu tomlifu commented Jun 25, 2024

What does this PR do ?

This PR adds context parallel support for packed dataset in THD format in NeMo in response to this TE PR: NVIDIA/TransformerEngine#641. Currently, the TE PR requires each individual sequence length is divisible by (2*context_parallel_size).

Changes

  • Add support to split packed dataset across different CP ranks in a load balanced way
  • Add necessary paddings to dataset during packing stage to make sure the individual sequence length is a multiple of 2*cp_size

PR Type:

  • New Feature
  • Bugfix
  • Documentation

@github-actions github-actions bot added the NLP label Jun 25, 2024
@tomlifu tomlifu changed the title [PyTorch] Add context parallel support for packed dataset in THD format [Draft][PyTorch] Add context parallel support for packed dataset in THD format Jun 26, 2024
@xrennvidia xrennvidia self-requested a review June 29, 2024 02:25
@xrennvidia
Copy link
Collaborator

You code indentation is not consistent, some places have 4 spaces, and other places have 2 spaces.
NeMo code has 4 spaces always, so please make sure all your code have 4-space indentation.

@tomlifu tomlifu changed the title [Draft][PyTorch] Add context parallel support for packed dataset in THD format [PyTorch] Add context parallel support for packed dataset in THD format Jul 9, 2024
tomlifu and others added 2 commits July 9, 2024 14:28
Signed-off-by: tomlifu <tomlifu@users.noreply.github.com>
@@ -17,6 +17,7 @@
from typing import TYPE_CHECKING, Tuple

import numpy as np
import torch

Check notice

Code scanning / CodeQL

Unused import Note

Import of 'torch' is not used.
Signed-off-by: tomlifu <tomzhanglf@gmail.com>
@xrennvidia
Copy link
Collaborator

Thanks for fixing the comments, it looks much better now.
The total sequence length (size of t in THD format) is a constant, right? If so, we should have some padded tokens at the end? how those padded tokens are split across different CP ranks?

ceil_to_nearest = lambda n, m: (n + m - 1) // m * m
for data in dataset:
max_length = min(max_seq_length, ceil_to_nearest(len(data['input_ids']), pad_seq_length_to_mult))
pre_pad_dataset(data, max_length, pad_id)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How the loss_mask is handled for padded tokens?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cu_seqlens = cu_seqlens // cp_size
forward_args = {
'input_ids': batch['tokens'],
'position_ids': batch['position_ids'],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the position_ids means the token_id in packed sequence? how is this argument used in training fwd and bwd?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The position_ids is the position of the tokens in a sequence (e.g. [0,1,2, ... , seq_len-1]). In a packed sequence, we have a list of position_ids since the packed sequence is composed of many individual sequences. I'm not too sure if that's what you mean by token_id. It's used the same way as input_ids in training fwd and bwd.

Copy link
Contributor

This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.

@github-actions github-actions bot added the stale label Jul 25, 2024
@github-actions github-actions bot removed the stale label Jul 26, 2024
Copy link
Contributor

github-actions bot commented Aug 9, 2024

This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.

@github-actions github-actions bot added the stale label Aug 9, 2024
Copy link
Contributor

This PR was closed because it has been inactive for 7 days since being marked as stale.

@github-actions github-actions bot closed this Aug 17, 2024
@youth123
Copy link

youth123 commented Oct 12, 2024

What does this PR do ?

This PR adds context parallel support for packed dataset in THD format in NeMo in response to this TE PR: NVIDIA/TransformerEngine#641. Currently, the TE PR requires each individual sequence length is divisible by (2*context_parallel_size).

Changes

  • Add support to split packed dataset across different CP ranks in a load balanced way
  • Add necessary paddings to dataset during packing stage to make sure the individual sequence length is a multiple of 2*cp_size

PR Type:

  • New Feature
  • Bugfix
  • Documentation

Hello, I have some problems with cp need seqlen divisibility world_size * 2. I see that you are padding data in the code, but this will cause the pad token id to enter the flash attention calculation. I don't know if this is correct.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants