Skip to content

Commit

Permalink
update fsdp config and scheduled trainer
Browse files Browse the repository at this point in the history
  • Loading branch information
Spico197 committed Aug 16, 2023
1 parent 61912db commit ef4c60a
Show file tree
Hide file tree
Showing 4 changed files with 8 additions and 5 deletions.
3 changes: 3 additions & 0 deletions conf/fsdp_config.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
{
"fsdp_transformer_layer_cls_to_wrap": "LlamaDecoderLayer"
}
4 changes: 2 additions & 2 deletions scripts/train_backward_Myx.sh
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ num_nodes=1
num_gpu_per_node=8

bsz=32
output_dir="outputs/backward"
output_dir="/dev/shm/tzhu/Humback/outputs/backward_model_on_seed_data_scheduled"
bsz_per_dev=$(echo "${bsz} / ${num_nodes} / ${num_gpu_per_node}" | bc)

torchrun \
Expand All @@ -28,7 +28,7 @@ torchrun \
--logging_strategy steps \
--logging_steps 1 \
--save_strategy epoch \
--save_total_limit 3 \
--save_total_limit 1 \
--output_dir ${output_dir} \
--overwrite_output_dir \
--ddp_timeout 30000 \
Expand Down
4 changes: 2 additions & 2 deletions scripts/train_seed.sh
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ num_nodes=1
num_gpu_per_node=8

bsz=32
output_dir="outputs/seed_model"
output_dir="/dev/shm/tzhu/Humback/outputs/forward_model_on_seed_data_scheduled"
bsz_per_dev=$(echo "${bsz} / ${num_nodes} / ${num_gpu_per_node}" | bc)

torchrun \
Expand All @@ -27,7 +27,7 @@ torchrun \
--logging_strategy steps \
--logging_steps 1 \
--save_strategy epoch \
--save_total_limit 3 \
--save_total_limit 1 \
--output_dir ${output_dir} \
--overwrite_output_dir \
--ddp_timeout 30000 \
Expand Down
2 changes: 1 addition & 1 deletion src/train.py
Original file line number Diff line number Diff line change
Expand Up @@ -369,7 +369,7 @@ def train():
data_module = make_supervised_data_module(tokenizer=tokenizer, data_args=data_args)

# Start trainner
trainer = Trainer(
trainer = ScheduledTrainer(
model=model, tokenizer=tokenizer, args=training_args, **data_module
)
if list(pathlib.Path(training_args.output_dir).glob("checkpoint-*")):
Expand Down

0 comments on commit ef4c60a

Please sign in to comment.