Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix handling of partially-empty initial batch #11

Merged
merged 1 commit into from
Sep 28, 2017
Merged

Conversation

myleott
Copy link

@myleott myleott commented Sep 28, 2017

Since PyTorch initializes gradient buffers lazily, it's important that the first batch doesn't contain any empty samples. This PR replaces empty samples by cycling through the given samples instead of using None.

@myleott myleott merged commit 4593ebf into master Sep 28, 2017
@myleott myleott deleted the fix_empty_batches branch September 28, 2017 16:46
myleott pushed a commit that referenced this pull request Jun 26, 2018
taylanbil added a commit to taylanbil/fairseq that referenced this pull request Oct 21, 2019
taylanbil added a commit to taylanbil/fairseq that referenced this pull request Nov 13, 2019
optimizer fix
progress bar comment out temporarily
some changes to train_tpu
int mask instead of float

pfpfpfpf

fix

printing device index per loop

bkpt to investigate resize_ call

attempting to init buffer size to 2*dim

bkpt

better print

do not drop records when computing loss

Changes that reduce graph compiles.

* Loss function replaced with an equivalent logic that doesn't resize
tensors.
* cli args changed to guarantee consistency
* collate_tokens function in fairseq/data/data_utils.py overwritten to
guarantee consistency

undoing some changes made while debugging

progress_bar implements len

some irrelevant changes to train_tpu.py

new xla changes

bug fix in enable_torch_version

removing the last batch that is of diferent size from the iterator

delete optimizer step in fairseq s trainer

Added `self.xla` flag that controls if Trainer includes optimizer step

+ Tried to include more explanation why skip optimizer step this time

deleted obsolete file

add norm clipping count back in (#4)

remove grad norm clip count (#5)

Change masked_fill_ input in loss in order to accomodate necessary pytorch changes (#6)

Adding tpu capabilities to train.py (facebookresearch#8)

* Adding tpu capabilities to train.py

* flush when printing for better user experience

* separated cli_main into parse_args, maingpu and maintpu
deleted unused line in datautils.py

Enumerate the loader in training and validation (facebookresearch#9)

* Adding tpu capabilities to train.py

* flush when printing for better user experience

* separated cli_main into parse_args, maingpu and maintpu
deleted unused line in datautils.py

* Enumerate the loader

* enumerate the loader

Add option to assert on training and/or validation loss (facebookresearch#10)

* Add option to assert on training and/or validation loss

* applied suggestion

None loss should be filled to inf (facebookresearch#11)

Enabling multiprocessing for fairseq training. (facebookresearch#12)

* initial commit for multiprocess api

* indentation fixes and import fix

* no need to softlink, fix save/load

* Remove the hacks to only save from master ordinal as xm.save takes care of that

* fix indentation; 3 -> 4 spaces

* Moved xu.eprints after spawn and dropping last batches better

trainers->trainer (facebookresearch#13)

fix bug in assert_on_losses

Replace usage of unsqueeze with transpose + broadcasting (facebookresearch#15)

remove attn mask + loss rewrite + save per host +

format
suppress loss report
allow usage of batch_by_size in translation.
attn_weights masked fill in place

Clean up the log output suppressing a bit

Revert multihead attn's in_proj code changes

non-rebased tpu branch is about 10% faster on TPUs
compared to the rebased branch. The regression is inside multihead
attn's in_proj mechanism. Reverting the relevant changes to preserve
performance.

Pass correct args to the new get_valid_stats function

Send meters to device in order not to fail training when resuming dfrom chkpt
yushuiwx pushed a commit to yushuiwx/fairseq that referenced this pull request Sep 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants