grad_multiply function use? #13

piyank22 · 2017-09-30T19:55:59Z

`class GradMultiply(torch.autograd.Function):
@staticmethod
def forward(ctx, x, scale):
ctx.scale = scale
res = x.new(x)
ctx.mark_shared_storage((x, res))
return res

@staticmethod
def backward(ctx, grad):
return grad * ctx.scale, None`

on reading the paper i have either completely missed or not come across something where we had to scale the gradient flowing back,
Please point out where i can get clarity on its use and why as to it has positive impact on the result.

piyank22 · 2017-09-30T20:11:05Z

[https://github.com/facebookresearch/fairseq-py/blob/master/fairseq/models/fconv.py#L118](Line in fconv.py)

gaopeng-eugene · 2017-10-23T04:12:34Z

Why you scale the gradient ? Any explanation ?

aa1607 · 2018-04-24T07:25:44Z

In case you were still wondering, it says here in the paper: "For convolutional decoders with multiple attention, we scale the gradients for the encoder layers by the number of attention mechanisms we use; we exclude source word embeddings. We found this to stabilize learning since the encoder received too much gradient otherwise."

https://arxiv.org/pdf/1705.03122.pdf

Train on multiple GPUs with computation happening on a separate Python Process per model (using multiprocessing) to avoid the GIL. Achieves ~6x speedup using 8 GPUs. Speed is comparable to Lua version: Version | # GPUs | effective bsz | words/s | # epochs | train loss | valid loss | BLEU ---------|--------|---------------|---------|----------|------------|------------|----- PyTorch | 1 | 32 | 5.6k | - | - | - | - LuaTorch | 1 | 32 | 5.4k | - | - | - | - PyTorch | 4 | 128 | 18.5k | 24 | 2.56 | 3.17 | 29.85 LuaTorch | 4 | 128 | 18.6k | 18 | 2.80 | 3.20 | 28.58 PyTorch | 8 | 256 | 33.0k | - | - | - | - LuaTorch | 8 | 256 | 32.0k | - | - | - | - Above results are on K80s with cuDNNv5 and NCCLv1.

optimizer fix progress bar comment out temporarily some changes to train_tpu int mask instead of float pfpfpfpf fix printing device index per loop bkpt to investigate resize_ call attempting to init buffer size to 2*dim bkpt better print do not drop records when computing loss Changes that reduce graph compiles. * Loss function replaced with an equivalent logic that doesn't resize tensors. * cli args changed to guarantee consistency * collate_tokens function in fairseq/data/data_utils.py overwritten to guarantee consistency undoing some changes made while debugging progress_bar implements len some irrelevant changes to train_tpu.py new xla changes bug fix in enable_torch_version removing the last batch that is of diferent size from the iterator delete optimizer step in fairseq s trainer Added `self.xla` flag that controls if Trainer includes optimizer step + Tried to include more explanation why skip optimizer step this time deleted obsolete file add norm clipping count back in (#4) remove grad norm clip count (#5) Change masked_fill_ input in loss in order to accomodate necessary pytorch changes (#6) Adding tpu capabilities to train.py (facebookresearch#8) * Adding tpu capabilities to train.py * flush when printing for better user experience * separated cli_main into parse_args, maingpu and maintpu deleted unused line in datautils.py Enumerate the loader in training and validation (facebookresearch#9) * Adding tpu capabilities to train.py * flush when printing for better user experience * separated cli_main into parse_args, maingpu and maintpu deleted unused line in datautils.py * Enumerate the loader * enumerate the loader Add option to assert on training and/or validation loss (facebookresearch#10) * Add option to assert on training and/or validation loss * applied suggestion None loss should be filled to inf (facebookresearch#11) Enabling multiprocessing for fairseq training. (facebookresearch#12) * initial commit for multiprocess api * indentation fixes and import fix * no need to softlink, fix save/load * Remove the hacks to only save from master ordinal as xm.save takes care of that * fix indentation; 3 -> 4 spaces * Moved xu.eprints after spawn and dropping last batches better trainers->trainer (facebookresearch#13) fix bug in assert_on_losses Replace usage of unsqueeze with transpose + broadcasting (facebookresearch#15) remove attn mask + loss rewrite + save per host + format suppress loss report allow usage of batch_by_size in translation. attn_weights masked fill in place Clean up the log output suppressing a bit Revert multihead attn's in_proj code changes non-rebased tpu branch is about 10% faster on TPUs compared to the rebased branch. The regression is inside multihead attn's in_proj mechanism. Reverting the relevant changes to preserve performance. Pass correct args to the new get_valid_stats function Send meters to device in order not to fail training when resuming dfrom chkpt

…acebookresearch#13) * Add grad_clip and weight-decay, small fix of dataloader and masking * Add RESULTS.md

piyank22 closed this as completed Oct 1, 2017

taylanbil added a commit to taylanbil/fairseq that referenced this issue Oct 28, 2019

trainers->trainer (facebookresearch#13)

22c4e9b

prashantserai mentioned this issue Jan 6, 2020

OpenMP error #1591

Closed

jogonba2 mentioned this issue Jul 9, 2020

BART-Large: RuntimeError: CUDA error: the launch timed out and was terminated #2311

Open

Gavin90s mentioned this issue Mar 11, 2021

speech_recognition/w2l_decoder.py load kenlm core dump #3337

Closed

Jxu-Thu mentioned this issue Jul 1, 2021

Wav2vec2 error after validation when training : terminate called after throwing an instance of 'c10::Error' #3674

Open

jiayidengYumy mentioned this issue Nov 20, 2023

CUDA error of self.embed_positions #5381

Closed

yfyeung pushed a commit to yfyeung/fairseq that referenced this issue Dec 6, 2023

The training script produce WER of 2.57% on librispeech test-clean (f…

ef23348

…acebookresearch#13) * Add grad_clip and weight-decay, small fix of dataloader and masking * Add RESULTS.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

grad_multiply function use? #13

grad_multiply function use? #13

piyank22 commented Sep 30, 2017 •

edited

Loading

piyank22 commented Sep 30, 2017

gaopeng-eugene commented Oct 23, 2017

aa1607 commented Apr 24, 2018 •

edited

Loading

grad_multiply function use? #13

grad_multiply function use? #13

Comments

piyank22 commented Sep 30, 2017 • edited Loading

piyank22 commented Sep 30, 2017

gaopeng-eugene commented Oct 23, 2017

aa1607 commented Apr 24, 2018 • edited Loading

piyank22 commented Sep 30, 2017 •

edited

Loading

aa1607 commented Apr 24, 2018 •

edited

Loading