Enable libUV for torchtrain #98

gnadathur · 2024-02-28T17:36:51Z

Enable libUV for torchtrain.

Test:

+ export USE_LIBUV=1
+ USE_LIBUV=1
+ TRAINER_DIR=/home/gnadathur/local/torchtrain
+ NGPU=4
+ LOG_RANK=0,1
+ CONFIG_FILE=./train_configs/debug_model.toml
+ torchrun --nproc_per_node=4 --rdzv_endpoint=localhost:5972 --local-ranks-filter 0,1 --role rank --tee 3 train.py --job.config_file ./train_configs/debug_model.toml
W0228 09:12:02.564000 140353616004096 torch/distributed/run.py:717] 
W0228 09:12:02.564000 140353616004096 torch/distributed/run.py:717] *****************************************
W0228 09:12:02.564000 140353616004096 torch/distributed/run.py:717] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0228 09:12:02.564000 140353616004096 torch/distributed/run.py:717] *****************************************
[rank0]:2024-02-28 09:12:04,581 - torchtrain.parallelisms - INFO - Building 1-D device mesh with ('dp',), [4]
[rank1]:2024-02-28 09:12:04,708 - torchtrain.parallelisms - INFO - Building 1-D device mesh with ('dp',), [4]
[rank0]:2024-02-28 09:12:05,647 - root - INFO - Building llama
[rank0]:2024-02-28 09:12:05,655 - root - INFO - Reloaded SentencePiece model from ./torchtrain/datasets/tokenizer/tokenizer.model
[rank0]:2024-02-28 09:12:05,655 - root - INFO - #words: 32000 - BOS ID: 1 - EOS ID: 2
[rank1]:2024-02-28 09:12:07,299 - root - INFO - Reloaded SentencePiece model from ./torchtrain/datasets/tokenizer/tokenizer.model
[rank1]:2024-02-28 09:12:07,299 - root - INFO - #words: 32000 - BOS ID: 1 - EOS ID: 2
[rank0]:2024-02-28 09:12:07,565 - root - INFO - Model fully initialized via reset_params
[rank0]:2024-02-28 09:12:07,566 - root - INFO - Model built with: ModelArgs(dim=256, n_layers=2, n_heads=16, n_kv_heads=None, vocab_size=32000, multiple_of=256, ffn_dim_multiplier=None, norm_eps=1e-05, max_batch_size=32, max_seq_len=32768, depth_init=True)
[rank0]:2024-02-28 09:12:07,566 - root - INFO - �[34mModel llama debugmodel �[31msize: 18,089,216 total parameters�[39m
[rank0]:2024-02-28 09:12:07,567 - root - INFO - GPU memory usage: NVIDIA H100 (0): 95.0396 GiB capacity, 0.0 GiB in-use, 0.0% in-use
[rank0]:2024-02-28 09:12:08,769 - root - INFO - Applied FSDP to the model...
[rank0]:2024-02-28 09:12:08,770 - root - INFO - Gradient scaling not enabled.
[rank0]:2024-02-28 09:12:08,770 - root - INFO - Metrics logging active. Tensorboard logs will be saved at ./outputs/tb/20240228-0912.
[rank0]:2024-02-28 09:12:08,977 - root - INFO - Profiling active.  Traces will be saved at ./outputs/profiling/traces
[rank0]:2024-02-28 09:12:10,956 - root - INFO - �[36mstep:  1  �[32mloss: 10.9229  �[39miter: �[34m 1.9386�[39m  data: �[34m0.0368  �[39mlr: �[33m0.00026667�[39m
[rank0]:2024-02-28 09:12:11,045 - root - INFO - �[36mstep:  2  �[32mloss: 10.8673  �[39miter: �[34m 0.0562�[39m  data: �[34m0.0316  �[39mlr: �[33m0.00053333�[39m
[rank0]:2024-02-28 09:12:11,130 - root - INFO - �[36mstep:  3  �[32mloss: 10.7145  �[39miter: �[34m 0.0523�[39m  data: �[34m0.0322  �[39mlr: �[33m0.0008�[39m
[rank0]:2024-02-28 09:12:11,219 - root - INFO - �[36mstep:  4  �[32mloss: 10.5038  �[39miter: �[34m 0.0559�[39m  data: �[34m0.0319  �[39mlr: �[33m0.0007�[39m
[rank0]:2024-02-28 09:12:11,304 - root - INFO - �[36mstep:  5  �[32mloss: 10.2228  �[39miter: �[34m 0.0537�[39m  data: �[34m0.031  �[39mlr: �[33m0.0006�[39m
[rank0]:2024-02-28 09:12:11,391 - root - INFO - �[36mstep:  6  �[32mloss:  9.9677  �[39miter: �[34m 0.0562�[39m  data: �[34m0.0302  �[39mlr: �[33m0.0005�[39m
[rank0]:2024-02-28 09:12:11,478 - root - INFO - �[36mstep:  7  �[32mloss:  9.7762  �[39miter: �[34m 0.0544�[39m  data: �[34m0.0317  �[39mlr: �[33m0.0004�[39m
[rank0]:2024-02-28 09:12:11,676 - root - INFO - �[36mstep:  8  �[32mloss:  9.4359  �[39miter: �[34m 0.0509�[39m  data: �[34m0.0322  �[39mlr: �[33m0.0003�[39m
[rank1]:STAGE:2024-02-28 09:12:11 3161834:3161834 ActivityProfilerController.cpp:314] Completed Stage: Warm Up
[rank1]:[rank1]:[W CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event
[rank0]:STAGE:2024-02-28 09:12:11 3161833:3161833 ActivityProfilerController.cpp:314] Completed Stage: Warm Up
[rank0]:2024-02-28 09:12:11,813 - root - INFO - �[36mstep:  9  �[32mloss:  9.2326  �[39miter: �[34m 0.1007�[39m  data: �[34m0.0321  �[39mlr: �[33m0.0002�[39m
[rank0]:[rank0]:[W CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event
[rank1]:STAGE:2024-02-28 09:12:11 3161834:3161834 ActivityProfilerController.cpp:320] Completed Stage: Collection
[rank1]:STAGE:2024-02-28 09:12:11 3161834:3161834 ActivityProfilerController.cpp:324] Completed Stage: Post Processing
[rank0]:STAGE:2024-02-28 09:12:11 3161833:3161833 ActivityProfilerController.cpp:320] Completed Stage: Collection
[rank0]:STAGE:2024-02-28 09:12:11 3161833:3161833 ActivityProfilerController.cpp:324] Completed Stage: Post Processing
[rank0]:2024-02-28 09:12:12,195 - root - INFO - exporting profile traces to ./outputs/profiling/traces/iteration_10
[rank0]:2024-02-28 09:12:12,207 - root - INFO - �[36mstep: 10  �[32mloss:  9.1641  �[39miter: �[34m 0.0971�[39m  data: �[34m0.031  �[39mlr: �[33m0.0001�[39m
[rank0]:2024-02-28 09:12:12,207 - root - INFO - Average iter time: 0.0670 seconds
[rank0]:2024-02-28 09:12:12,207 - root - INFO - Average data load time: 0.0314 seconds
[rank0]:2024-02-28 09:12:12,208 - root - INFO - Current Memory: NVIDIA H100 (0): Reserved: 9.6465%, Alloc 2.1969%, Active: 2.2%
[rank0]:Peak Memory: Reserved 9.65%, Alloc 8.43%, Active: 8.44%
[rank0]:num retries: 0, num ooms: 0
[rank0]:NCCL version 2.19.3+cuda12.0

XilunWu · 2024-02-28T18:09:32Z

unblock the enablement of libuv in torchtrain cc @gnadathur

wanchaol · 2024-02-28T18:28:38Z

run_llama_train.sh

@@ -2,6 +2,7 @@

 set -ex

+export USE_LIBUV=1


IMO we should at least add a comment why we are using libuv here

Enable libUV for torchtrain. Test: ``` + export USE_LIBUV=1 + USE_LIBUV=1 + TRAINER_DIR=/home/gnadathur/local/torchtrain + NGPU=4 + LOG_RANK=0,1 + CONFIG_FILE=./train_configs/debug_model.toml + torchrun --nproc_per_node=4 --rdzv_endpoint=localhost:5972 --local-ranks-filter 0,1 --role rank --tee 3 train.py --job.config_file ./train_configs/debug_model.toml W0228 09:12:02.564000 140353616004096 torch/distributed/run.py:717] W0228 09:12:02.564000 140353616004096 torch/distributed/run.py:717] ***************************************** W0228 09:12:02.564000 140353616004096 torch/distributed/run.py:717] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0228 09:12:02.564000 140353616004096 torch/distributed/run.py:717] ***************************************** [rank0]:2024-02-28 09:12:04,581 - torchtrain.parallelisms - INFO - Building 1-D device mesh with ('dp',), [4] [rank1]:2024-02-28 09:12:04,708 - torchtrain.parallelisms - INFO - Building 1-D device mesh with ('dp',), [4] [rank0]:2024-02-28 09:12:05,647 - root - INFO - Building llama [rank0]:2024-02-28 09:12:05,655 - root - INFO - Reloaded SentencePiece model from ./torchtrain/datasets/tokenizer/tokenizer.model [rank0]:2024-02-28 09:12:05,655 - root - INFO - #words: 32000 - BOS ID: 1 - EOS ID: 2 [rank1]:2024-02-28 09:12:07,299 - root - INFO - Reloaded SentencePiece model from ./torchtrain/datasets/tokenizer/tokenizer.model [rank1]:2024-02-28 09:12:07,299 - root - INFO - #words: 32000 - BOS ID: 1 - EOS ID: 2 [rank0]:2024-02-28 09:12:07,565 - root - INFO - Model fully initialized via reset_params [rank0]:2024-02-28 09:12:07,566 - root - INFO - Model built with: ModelArgs(dim=256, n_layers=2, n_heads=16, n_kv_heads=None, vocab_size=32000, multiple_of=256, ffn_dim_multiplier=None, norm_eps=1e-05, max_batch_size=32, max_seq_len=32768, depth_init=True) [rank0]:2024-02-28 09:12:07,566 - root - INFO - �[34mModel llama debugmodel �[31msize: 18,089,216 total parameters�[39m [rank0]:2024-02-28 09:12:07,567 - root - INFO - GPU memory usage: NVIDIA H100 (0): 95.0396 GiB capacity, 0.0 GiB in-use, 0.0% in-use [rank0]:2024-02-28 09:12:08,769 - root - INFO - Applied FSDP to the model... [rank0]:2024-02-28 09:12:08,770 - root - INFO - Gradient scaling not enabled. [rank0]:2024-02-28 09:12:08,770 - root - INFO - Metrics logging active. Tensorboard logs will be saved at ./outputs/tb/20240228-0912. [rank0]:2024-02-28 09:12:08,977 - root - INFO - Profiling active. Traces will be saved at ./outputs/profiling/traces [rank0]:2024-02-28 09:12:10,956 - root - INFO - �[36mstep: 1 �[32mloss: 10.9229 �[39miter: �[34m 1.9386�[39m data: �[34m0.0368 �[39mlr: �[33m0.00026667�[39m [rank0]:2024-02-28 09:12:11,045 - root - INFO - �[36mstep: 2 �[32mloss: 10.8673 �[39miter: �[34m 0.0562�[39m data: �[34m0.0316 �[39mlr: �[33m0.00053333�[39m [rank0]:2024-02-28 09:12:11,130 - root - INFO - �[36mstep: 3 �[32mloss: 10.7145 �[39miter: �[34m 0.0523�[39m data: �[34m0.0322 �[39mlr: �[33m0.0008�[39m [rank0]:2024-02-28 09:12:11,219 - root - INFO - �[36mstep: 4 �[32mloss: 10.5038 �[39miter: �[34m 0.0559�[39m data: �[34m0.0319 �[39mlr: �[33m0.0007�[39m [rank0]:2024-02-28 09:12:11,304 - root - INFO - �[36mstep: 5 �[32mloss: 10.2228 �[39miter: �[34m 0.0537�[39m data: �[34m0.031 �[39mlr: �[33m0.0006�[39m [rank0]:2024-02-28 09:12:11,391 - root - INFO - �[36mstep: 6 �[32mloss: 9.9677 �[39miter: �[34m 0.0562�[39m data: �[34m0.0302 �[39mlr: �[33m0.0005�[39m [rank0]:2024-02-28 09:12:11,478 - root - INFO - �[36mstep: 7 �[32mloss: 9.7762 �[39miter: �[34m 0.0544�[39m data: �[34m0.0317 �[39mlr: �[33m0.0004�[39m [rank0]:2024-02-28 09:12:11,676 - root - INFO - �[36mstep: 8 �[32mloss: 9.4359 �[39miter: �[34m 0.0509�[39m data: �[34m0.0322 �[39mlr: �[33m0.0003�[39m [rank1]:STAGE:2024-02-28 09:12:11 3161834:3161834 ActivityProfilerController.cpp:314] Completed Stage: Warm Up [rank1]:[rank1]:[W CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event [rank0]:STAGE:2024-02-28 09:12:11 3161833:3161833 ActivityProfilerController.cpp:314] Completed Stage: Warm Up [rank0]:2024-02-28 09:12:11,813 - root - INFO - �[36mstep: 9 �[32mloss: 9.2326 �[39miter: �[34m 0.1007�[39m data: �[34m0.0321 �[39mlr: �[33m0.0002�[39m [rank0]:[rank0]:[W CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event [rank1]:STAGE:2024-02-28 09:12:11 3161834:3161834 ActivityProfilerController.cpp:320] Completed Stage: Collection [rank1]:STAGE:2024-02-28 09:12:11 3161834:3161834 ActivityProfilerController.cpp:324] Completed Stage: Post Processing [rank0]:STAGE:2024-02-28 09:12:11 3161833:3161833 ActivityProfilerController.cpp:320] Completed Stage: Collection [rank0]:STAGE:2024-02-28 09:12:11 3161833:3161833 ActivityProfilerController.cpp:324] Completed Stage: Post Processing [rank0]:2024-02-28 09:12:12,195 - root - INFO - exporting profile traces to ./outputs/profiling/traces/iteration_10 [rank0]:2024-02-28 09:12:12,207 - root - INFO - �[36mstep: 10 �[32mloss: 9.1641 �[39miter: �[34m 0.0971�[39m data: �[34m0.031 �[39mlr: �[33m0.0001�[39m [rank0]:2024-02-28 09:12:12,207 - root - INFO - Average iter time: 0.0670 seconds [rank0]:2024-02-28 09:12:12,207 - root - INFO - Average data load time: 0.0314 seconds [rank0]:2024-02-28 09:12:12,208 - root - INFO - Current Memory: NVIDIA H100 (0): Reserved: 9.6465%, Alloc 2.1969%, Active: 2.2% [rank0]:Peak Memory: Reserved 9.65%, Alloc 8.43%, Active: 8.44% [rank0]:num retries: 0, num ooms: 0 [rank0]:NCCL version 2.19.3+cuda12.0 ``` --------- Co-authored-by: gnadathur <gnadathur@devvm4378.nao0.facebook.com>

Enable libUV for torchtrain

35cec19

gnadathur requested a review from XilunWu February 28, 2024 17:36

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 28, 2024

gnadathur requested a review from wanchaol February 28, 2024 18:04

XilunWu approved these changes Feb 28, 2024

View reviewed changes

wanchaol reviewed Feb 28, 2024

View reviewed changes

add comment for the env variable

9066b6c

gnadathur merged commit 01bc39e into main Feb 28, 2024
4 checks passed

gnadathur deleted the libuv branch March 1, 2024 00:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable libUV for torchtrain #98

Enable libUV for torchtrain #98

gnadathur commented Feb 28, 2024

XilunWu commented Feb 28, 2024

wanchaol Feb 28, 2024

gnadathur Feb 28, 2024

Enable libUV for torchtrain #98

Enable libUV for torchtrain #98

Conversation

gnadathur commented Feb 28, 2024

XilunWu commented Feb 28, 2024

wanchaol Feb 28, 2024

Choose a reason for hiding this comment

gnadathur Feb 28, 2024

Choose a reason for hiding this comment