Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rocm docker and test scripts #5

Merged
merged 11 commits into from
Nov 8, 2024

Conversation

gurpreet-dhami
Copy link
Collaborator

No description provided.

@gurpreet-dhami gurpreet-dhami force-pushed the rocm_megatron_lm_upstream_rocm_docker branch from 26f78e6 to c4f0b5f Compare October 8, 2024 17:10
@gurpreet-dhami gurpreet-dhami force-pushed the rocm_megatron_lm_upstream_rocm_docker branch from c4f0b5f to aeed2cd Compare October 8, 2024 21:04
Dockerfile_amd Outdated
##############################################################################
# Apex
##############################################################################
#RUN git clone https://github.com/ROCm/apex.git ${STAGE_DIR}/apex
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this code, if its not needed.

train_llama.sh Outdated
TE_FP16="${TE_FP16:-1}"


export CUDA_DEVICE_MAX_CONNECTIONS=1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this one on ROCm?

train_llama.sh Outdated
wget -O $TOKENIZER_MODEL https://huggingface.co/NousResearch/Llama-2-7b-chat-hf/resolve/main/tokenizer.model
fi

# Prepare the dataset
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seperate out the dataset preparation into a different script, so that we dont' need to download it everytime or make some changes to this script. We can use it as is without any modifications.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that once the dataset is downloaded, it will not need to download it again when we rerun the script.

@gurpreet-dhami gurpreet-dhami force-pushed the rocm_megatron_lm_upstream_rocm_docker branch from bdf4c6b to 6dbde6a Compare October 11, 2024 16:32
Dockerfile_amd Outdated
@@ -0,0 +1,84 @@
ARG BASE_DOCKER=rocm/pytorch:latest
#ARG BASE_DOCKER=rocm/pytorch-private:exec_dashboard_nightly
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the commented out lines.

Dockerfile_amd Outdated
WORKDIR $WORKSPACE_DIR
RUN git clone https://github.com/ROCm/Megatron-LM.git Megatron-LM &&\
cd Megatron-LM &&\
git checkout rocm_megatron_lm_upstream &&\
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now we will use rocm_dev as the main branch.

train_llama.sh Outdated
SEQ_PARALLEL="${SEQ_PARALLEL:-1}"
CONTI_PARAMS="${CONTI_PARAMS:-0}"
OPTIMIZER="${OPTIMIZER:-sgd}"
TE_FP16="${TE_FP16:-1}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name TE_FP16 is confusing because I think it is actually using bf16.

train_llama.sh Outdated


# Change for multinode config
MASTER_ADDR=localhost
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we intend to use this script for single-node only or do we also want to use it for multi-node? If latter, we'd better make those multi-node related options also be able to be specified from the command line.

@lcskrishna
Copy link
Collaborator

@gurpreet-dhami Put these scripts under examples/llama folder similarly how other workloads are arranged and create a README.md on how to create dataset, and add the script there.

@gurpreet-dhami gurpreet-dhami force-pushed the rocm_megatron_lm_upstream_rocm_docker branch 2 times, most recently from ec8bfa6 to 4ec12e4 Compare November 8, 2024 04:26
@gurpreet-dhami gurpreet-dhami force-pushed the rocm_megatron_lm_upstream_rocm_docker branch from 4ec12e4 to 77113cc Compare November 8, 2024 04:27
--eval-iters -1
"

# --save-interval $TOTAL_ITERS \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the commented lines.

--no-masked-softmax-fusion \
--overlap-grad-reduce \
"
# --no-masked-softmax-fusion \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the commented lines.


MEAN_LOG_SCRIPT=examples/llama2/mean_log_value.py
TMP_FILE=${TMP_DIR}/tmp.txt
# echo '============================================================================================================'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the commented lines.

echo "throughput per GPU (TFLOPs/GPU): ${THROUGHPUT}"
rm $TMP_FILE

# echo '============================================================================================================'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the commented lines.

@gurpreet-dhami gurpreet-dhami merged commit d9a6c85 into rocm_dev Nov 8, 2024
@gurpreet-dhami gurpreet-dhami deleted the rocm_megatron_lm_upstream_rocm_docker branch November 8, 2024 19:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants