Fix handling of partially-empty initial batch #11

myleott · 2017-09-28T16:31:33Z

Since PyTorch initializes gradient buffers lazily, it's important that the first batch doesn't contain any empty samples. This PR replaces empty samples by cycling through the given samples instead of using None.

optimizer fix progress bar comment out temporarily some changes to train_tpu int mask instead of float pfpfpfpf fix printing device index per loop bkpt to investigate resize_ call attempting to init buffer size to 2*dim bkpt better print do not drop records when computing loss Changes that reduce graph compiles. * Loss function replaced with an equivalent logic that doesn't resize tensors. * cli args changed to guarantee consistency * collate_tokens function in fairseq/data/data_utils.py overwritten to guarantee consistency undoing some changes made while debugging progress_bar implements len some irrelevant changes to train_tpu.py new xla changes bug fix in enable_torch_version removing the last batch that is of diferent size from the iterator delete optimizer step in fairseq s trainer Added `self.xla` flag that controls if Trainer includes optimizer step + Tried to include more explanation why skip optimizer step this time deleted obsolete file add norm clipping count back in (#4) remove grad norm clip count (#5) Change masked_fill_ input in loss in order to accomodate necessary pytorch changes (#6) Adding tpu capabilities to train.py (facebookresearch#8) * Adding tpu capabilities to train.py * flush when printing for better user experience * separated cli_main into parse_args, maingpu and maintpu deleted unused line in datautils.py Enumerate the loader in training and validation (facebookresearch#9) * Adding tpu capabilities to train.py * flush when printing for better user experience * separated cli_main into parse_args, maingpu and maintpu deleted unused line in datautils.py * Enumerate the loader * enumerate the loader Add option to assert on training and/or validation loss (facebookresearch#10) * Add option to assert on training and/or validation loss * applied suggestion None loss should be filled to inf (facebookresearch#11) Enabling multiprocessing for fairseq training. (facebookresearch#12) * initial commit for multiprocess api * indentation fixes and import fix * no need to softlink, fix save/load * Remove the hacks to only save from master ordinal as xm.save takes care of that * fix indentation; 3 -> 4 spaces * Moved xu.eprints after spawn and dropping last batches better trainers->trainer (facebookresearch#13) fix bug in assert_on_losses Replace usage of unsqueeze with transpose + broadcasting (facebookresearch#15) remove attn mask + loss rewrite + save per host + format suppress loss report allow usage of batch_by_size in translation. attn_weights masked fill in place Clean up the log output suppressing a bit Revert multihead attn's in_proj code changes non-rebased tpu branch is about 10% faster on TPUs compared to the rebased branch. The regression is inside multihead attn's in_proj mechanism. Reverting the relevant changes to preserve performance. Pass correct args to the new get_valid_stats function Send meters to device in order not to fail training when resuming dfrom chkpt

Moe3

Fix handling of partially-empty initial batch

fedbeb9

facebook-github-bot added the CLA Signed label Sep 28, 2017

myleott merged commit 4593ebf into master Sep 28, 2017

myleott deleted the fix_empty_batches branch September 28, 2017 16:46

myleott pushed a commit that referenced this pull request Jun 26, 2018

Fix bug causing us to checkpoint on every update (#11)

ef0f174

myleott pushed a commit that referenced this pull request Jun 26, 2018

Fix handling of partially-empty initial batch (#11)

95a3446

taylanbil added a commit to taylanbil/fairseq that referenced this pull request Oct 21, 2019

None loss should be filled to inf (facebookresearch#11)

6ce1fad

jogonba2 mentioned this pull request Jul 9, 2020

BART-Large: RuntimeError: CUDA error: the launch timed out and was terminated #2311

Open

Gavin90s mentioned this pull request Mar 11, 2021

speech_recognition/w2l_decoder.py load kenlm core dump #3337

Closed

Jxu-Thu mentioned this pull request Jul 1, 2021

Wav2vec2 error after validation when training : terminate called after throwing an instance of 'c10::Error' #3674

Open

jiayidengYumy mentioned this pull request Nov 20, 2023

CUDA error of self.embed_positions #5381

Closed

yushuiwx pushed a commit to yushuiwx/fairseq that referenced this pull request Sep 26, 2024

Merge pull request facebookresearch#11 from shumingma/moe3

4a2ec7f

Moe3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix handling of partially-empty initial batch #11

Fix handling of partially-empty initial batch #11

myleott commented Sep 28, 2017

Fix handling of partially-empty initial batch #11

Fix handling of partially-empty initial batch #11

Conversation

myleott commented Sep 28, 2017