Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Repeatability of Small Model Training Script with fixed seed(s) and same dataset #92

Open
pad9153 opened this issue Jun 6, 2024 · 1 comment
Assignees

Comments

@pad9153
Copy link

pad9153 commented Jun 6, 2024

We observed noticeable variability when re-running the FSDP model training script for a small 1.xB llama2 model with fixed seed(s) and same tokens. Below is a snapshot of the evaluation results on three models created with the same inputs (tokens, training script, seed(s)). Would you please help us investigate the root cause of this variability (data loader, hardware variability or other additional variables)? Thanks in advance!

image
@dangxuanhong
Copy link

Yes, above results were from 3 runs of the same yaml file (i.e., same model config, dataset, training params, random-seed, etc.) except for the change of experiment_id. The general setting is:

tokenizer: /cos_ablation/tokenizers/bigcode_starcoder
max_seq_len: 8192
vocab_size: 49152
seed: 42
save_steps: 5000
max_steps: 35000
do_lmeval: True
learning_rate: 6e-4
max_batch_len: 2
num_nodes: 8
use_profiler: "False"
eos_token: "0"
bos_token: "None"
logical_shards: 640

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants