-
Notifications
You must be signed in to change notification settings - Fork 6.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
grad_multiply function use? #13
Comments
[https://github.com/facebookresearch/fairseq-py/blob/master/fairseq/models/fconv.py#L118](Line in fconv.py) |
Why you scale the gradient ? Any explanation ? |
In case you were still wondering, it says here in the paper: "For convolutional decoders with multiple attention, we scale the gradients for the encoder layers by the number of attention mechanisms we use; we exclude source word embeddings. We found this to stabilize learning since the encoder received too much gradient otherwise." |
Train on multiple GPUs with computation happening on a separate Python Process per model (using multiprocessing) to avoid the GIL. Achieves ~6x speedup using 8 GPUs. Speed is comparable to Lua version: Version | # GPUs | effective bsz | words/s | # epochs | train loss | valid loss | BLEU ---------|--------|---------------|---------|----------|------------|------------|----- PyTorch | 1 | 32 | 5.6k | - | - | - | - LuaTorch | 1 | 32 | 5.4k | - | - | - | - PyTorch | 4 | 128 | 18.5k | 24 | 2.56 | 3.17 | 29.85 LuaTorch | 4 | 128 | 18.6k | 18 | 2.80 | 3.20 | 28.58 PyTorch | 8 | 256 | 33.0k | - | - | - | - LuaTorch | 8 | 256 | 32.0k | - | - | - | - Above results are on K80s with cuDNNv5 and NCCLv1.
optimizer fix progress bar comment out temporarily some changes to train_tpu int mask instead of float pfpfpfpf fix printing device index per loop bkpt to investigate resize_ call attempting to init buffer size to 2*dim bkpt better print do not drop records when computing loss Changes that reduce graph compiles. * Loss function replaced with an equivalent logic that doesn't resize tensors. * cli args changed to guarantee consistency * collate_tokens function in fairseq/data/data_utils.py overwritten to guarantee consistency undoing some changes made while debugging progress_bar implements len some irrelevant changes to train_tpu.py new xla changes bug fix in enable_torch_version removing the last batch that is of diferent size from the iterator delete optimizer step in fairseq s trainer Added `self.xla` flag that controls if Trainer includes optimizer step + Tried to include more explanation why skip optimizer step this time deleted obsolete file add norm clipping count back in (#4) remove grad norm clip count (#5) Change masked_fill_ input in loss in order to accomodate necessary pytorch changes (#6) Adding tpu capabilities to train.py (facebookresearch#8) * Adding tpu capabilities to train.py * flush when printing for better user experience * separated cli_main into parse_args, maingpu and maintpu deleted unused line in datautils.py Enumerate the loader in training and validation (facebookresearch#9) * Adding tpu capabilities to train.py * flush when printing for better user experience * separated cli_main into parse_args, maingpu and maintpu deleted unused line in datautils.py * Enumerate the loader * enumerate the loader Add option to assert on training and/or validation loss (facebookresearch#10) * Add option to assert on training and/or validation loss * applied suggestion None loss should be filled to inf (facebookresearch#11) Enabling multiprocessing for fairseq training. (facebookresearch#12) * initial commit for multiprocess api * indentation fixes and import fix * no need to softlink, fix save/load * Remove the hacks to only save from master ordinal as xm.save takes care of that * fix indentation; 3 -> 4 spaces * Moved xu.eprints after spawn and dropping last batches better trainers->trainer (facebookresearch#13) fix bug in assert_on_losses Replace usage of unsqueeze with transpose + broadcasting (facebookresearch#15) remove attn mask + loss rewrite + save per host + format suppress loss report allow usage of batch_by_size in translation. attn_weights masked fill in place Clean up the log output suppressing a bit Revert multihead attn's in_proj code changes non-rebased tpu branch is about 10% faster on TPUs compared to the rebased branch. The regression is inside multihead attn's in_proj mechanism. Reverting the relevant changes to preserve performance. Pass correct args to the new get_valid_stats function Send meters to device in order not to fail training when resuming dfrom chkpt
…acebookresearch#13) * Add grad_clip and weight-decay, small fix of dataloader and masking * Add RESULTS.md
`class GradMultiply(torch.autograd.Function):
@staticmethod
def forward(ctx, x, scale):
ctx.scale = scale
res = x.new(x)
ctx.mark_shared_storage((x, res))
return res
@staticmethod
def backward(ctx, grad):
return grad * ctx.scale, None`
on reading the paper i have either completely missed or not come across something where we had to scale the gradient flowing back,
Please point out where i can get clarity on its use and why as to it has positive impact on the result.
The text was updated successfully, but these errors were encountered: