rocm docker and test scripts #5

gurpreet-dhami · 2024-10-08T17:07:18Z

No description provided.

Dockerfile_amd

lcskrishna · 2024-10-09T15:14:02Z

Dockerfile_amd

+##############################################################################
+# Apex
+##############################################################################
+#RUN git clone https://github.com/ROCm/apex.git ${STAGE_DIR}/apex 


Remove this code, if its not needed.

lcskrishna · 2024-10-09T15:17:01Z

train_llama.sh

+TE_FP16="${TE_FP16:-1}"
+
+
+export CUDA_DEVICE_MAX_CONNECTIONS=1


Do we need this one on ROCm?

lcskrishna · 2024-10-09T15:18:11Z

train_llama.sh

+wget -O $TOKENIZER_MODEL https://huggingface.co/NousResearch/Llama-2-7b-chat-hf/resolve/main/tokenizer.model
+fi
+
+# Prepare the dataset


Seperate out the dataset preparation into a different script, so that we dont' need to download it everytime or make some changes to this script. We can use it as is without any modifications.

It seems that once the dataset is downloaded, it will not need to download it again when we rerun the script.

wenchenvincent · 2024-10-19T03:58:50Z

Dockerfile_amd

@@ -0,0 +1,84 @@
+ARG BASE_DOCKER=rocm/pytorch:latest
+#ARG BASE_DOCKER=rocm/pytorch-private:exec_dashboard_nightly


Remove the commented out lines.

wenchenvincent · 2024-10-19T03:59:52Z

Dockerfile_amd

+WORKDIR $WORKSPACE_DIR
+RUN git clone https://github.com/ROCm/Megatron-LM.git Megatron-LM &&\
+    cd Megatron-LM &&\
+    git checkout rocm_megatron_lm_upstream &&\


Now we will use rocm_dev as the main branch.

wenchenvincent · 2024-10-19T04:04:02Z

train_llama.sh

+SEQ_PARALLEL="${SEQ_PARALLEL:-1}" 
+CONTI_PARAMS="${CONTI_PARAMS:-0}"
+OPTIMIZER="${OPTIMIZER:-sgd}"
+TE_FP16="${TE_FP16:-1}"


The name TE_FP16 is confusing because I think it is actually using bf16.

wenchenvincent · 2024-10-19T04:06:02Z

train_llama.sh

+
+
+# Change for multinode config
+MASTER_ADDR=localhost


Do we intend to use this script for single-node only or do we also want to use it for multi-node? If latter, we'd better make those multi-node related options also be able to be specified from the command line.

lcskrishna · 2024-10-28T04:56:29Z

@gurpreet-dhami Put these scripts under examples/llama folder similarly how other workloads are arranged and create a README.md on how to create dataset, and add the script there.

wenchenvincent · 2024-11-08T18:00:39Z

examples/llama2/train_llama2.sh

+    --eval-iters -1
+"
+
+    # --save-interval $TOTAL_ITERS \


Remove the commented lines.

wenchenvincent · 2024-11-08T18:00:49Z

examples/llama2/train_llama2.sh

+    --no-masked-softmax-fusion \
+    --overlap-grad-reduce \
+"
+    # --no-masked-softmax-fusion \


Remove the commented lines.

wenchenvincent · 2024-11-08T18:01:38Z

examples/llama2/train_llama2.sh

+
+MEAN_LOG_SCRIPT=examples/llama2/mean_log_value.py
+TMP_FILE=${TMP_DIR}/tmp.txt
+# echo '============================================================================================================'


Remove the commented lines.

wenchenvincent · 2024-11-08T18:01:55Z

examples/llama2/train_llama2.sh

+echo "throughput per GPU (TFLOPs/GPU): ${THROUGHPUT}"
+rm $TMP_FILE
+
+# echo '============================================================================================================'


Remove the commented lines.

gurpreet-dhami force-pushed the rocm_megatron_lm_upstream_rocm_docker branch from 26f78e6 to c4f0b5f Compare October 8, 2024 17:10

gurpreet-dhami requested review from lcskrishna and wenchenvincent October 8, 2024 17:11

rocm docker and scripts

aeed2cd

gurpreet-dhami force-pushed the rocm_megatron_lm_upstream_rocm_docker branch from c4f0b5f to aeed2cd Compare October 8, 2024 21:04

lcskrishna requested changes Oct 9, 2024

View reviewed changes

gurpreet-dhami added 3 commits October 9, 2024 16:53

update mock data

12952bf

update

9713598

update

6dbde6a

gurpreet-dhami force-pushed the rocm_megatron_lm_upstream_rocm_docker branch from bdf4c6b to 6dbde6a Compare October 11, 2024 16:32

wenchenvincent reviewed Oct 19, 2024

View reviewed changes

address review comments

2a3af13

gurpreet-dhami requested review from lcskrishna and wenchenvincent October 22, 2024 14:50

update

967673b

gurpreet-dhami added 3 commits November 8, 2024 00:49

update dockerfile

b142a98

update script

a533cce

Merge branch 'rocm_dev' into rocm_megatron_lm_upstream_rocm_docker

90dbbfd

gurpreet-dhami force-pushed the rocm_megatron_lm_upstream_rocm_docker branch 2 times, most recently from ec8bfa6 to 4ec12e4 Compare November 8, 2024 04:26

update

77113cc

gurpreet-dhami force-pushed the rocm_megatron_lm_upstream_rocm_docker branch from 4ec12e4 to 77113cc Compare November 8, 2024 04:27

lcskrishna approved these changes Nov 8, 2024

View reviewed changes

wenchenvincent reviewed Nov 8, 2024

View reviewed changes

examples/llama2/train_llama2.sh Outdated

--eval-iters -1

"

# --save-interval $TOTAL_ITERS \

Copy link

Collaborator

wenchenvincent Nov 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the commented lines.

wenchenvincent reviewed Nov 8, 2024

View reviewed changes

remove commented lines

1e64046

gurpreet-dhami requested a review from wenchenvincent November 8, 2024 19:11

wenchenvincent approved these changes Nov 8, 2024

View reviewed changes

gurpreet-dhami merged commit d9a6c85 into rocm_dev Nov 8, 2024

gurpreet-dhami deleted the rocm_megatron_lm_upstream_rocm_docker branch November 8, 2024 19:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rocm docker and test scripts #5

rocm docker and test scripts #5

gurpreet-dhami commented Oct 8, 2024

lcskrishna Oct 9, 2024

lcskrishna Oct 9, 2024

lcskrishna Oct 9, 2024

wenchenvincent Oct 19, 2024

wenchenvincent Oct 19, 2024

wenchenvincent Oct 19, 2024

wenchenvincent Oct 19, 2024

wenchenvincent Oct 19, 2024

lcskrishna commented Oct 28, 2024

wenchenvincent Nov 8, 2024

wenchenvincent Nov 8, 2024

wenchenvincent Nov 8, 2024

wenchenvincent Nov 8, 2024

		@@ -0,0 +1,84 @@
		ARG BASE_DOCKER=rocm/pytorch:latest
		#ARG BASE_DOCKER=rocm/pytorch-private:exec_dashboard_nightly

		TE_FP16="${TE_FP16:-1}"


		export CUDA_DEVICE_MAX_CONNECTIONS=1

rocm docker and test scripts #5

rocm docker and test scripts #5

Conversation

gurpreet-dhami commented Oct 8, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lcskrishna commented Oct 28, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment