Fix data preprocessing #58

nsakkine · 2025-02-11T15:14:46Z

Potential bug in tools/prerpocess_data.py.

Bookcorpus preprocessing fails under certain conditions when using rocm/pytorch-training:v25.2 image and installing Megatron-LM with pip install . from the latest ROCm/Megatron-LM rocm_dev branch. With

/workspace/Megatron-LM# ls bookcorpus
bookcorpus_megatron.json  tokenizer.model

and running

/workspace/Megatron-LM# python3 tools/preprocess_data.py --input bookcorpus/bookcorpus_megatron.json  --tokenizer-type GPTSentencePieceTokenizer --tokenizer-model bookcorpus/tokenizer.model --output-prefix bookcorpus/bookcorpus --workers `nproc` --split-sentences --partitions 1

only executes sentence splitting while doing the same with --partitions 2 runs both sentence splitting and tokenization. Furthermore, running the above with --partitions 1 twice should also be a work-around since then split-sentence data is available in the second go.

Proposed solution is to remove 2 lines in tools/preprocess_data.py.

… early

nsakkine · 2025-02-11T15:19:45Z

@aivanni @gurpreet-dhami @wenchenvincent request to confirm if this is a bug

Removed 2 lines which led to 'tools/preprocess_data.py' to return too…

f4cd3fe

… early

nsakkine added the bug Something isn't working label Feb 11, 2025

nsakkine requested review from aivanni, wenchenvincent and gurpreet-dhami February 11, 2025 15:14

nsakkine self-assigned this Feb 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix data preprocessing #58

Fix data preprocessing #58

nsakkine commented Feb 11, 2025

nsakkine commented Feb 11, 2025

Fix data preprocessing #58

Are you sure you want to change the base?

Fix data preprocessing #58

Conversation

nsakkine commented Feb 11, 2025

nsakkine commented Feb 11, 2025