Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix data preprocessing #58

Draft
wants to merge 1 commit into
base: rocm_dev
Choose a base branch
from
Draft

Fix data preprocessing #58

wants to merge 1 commit into from

Conversation

nsakkine
Copy link

Potential bug in tools/prerpocess_data.py.

Bookcorpus preprocessing fails under certain conditions when using rocm/pytorch-training:v25.2 image and installing Megatron-LM with pip install . from the latest ROCm/Megatron-LM rocm_dev branch. With

/workspace/Megatron-LM# ls bookcorpus
bookcorpus_megatron.json  tokenizer.model

and running

/workspace/Megatron-LM# python3 tools/preprocess_data.py --input bookcorpus/bookcorpus_megatron.json  --tokenizer-type GPTSentencePieceTokenizer --tokenizer-model bookcorpus/tokenizer.model --output-prefix bookcorpus/bookcorpus --workers `nproc` --split-sentences --partitions 1

only executes sentence splitting while doing the same with --partitions 2 runs both sentence splitting and tokenization. Furthermore, running the above with --partitions 1 twice should also be a work-around since then split-sentence data is available in the second go.

Proposed solution is to remove 2 lines in tools/preprocess_data.py.

@nsakkine nsakkine added the bug Something isn't working label Feb 11, 2025
@nsakkine nsakkine self-assigned this Feb 11, 2025
@nsakkine
Copy link
Author

@aivanni @gurpreet-dhami @wenchenvincent request to confirm if this is a bug

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant