Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

grad_multiply function use? #13

Closed
piyank22 opened this issue Sep 30, 2017 · 3 comments
Closed

grad_multiply function use? #13

piyank22 opened this issue Sep 30, 2017 · 3 comments

Comments

@piyank22
Copy link

piyank22 commented Sep 30, 2017

`class GradMultiply(torch.autograd.Function):
@staticmethod
def forward(ctx, x, scale):
  ctx.scale = scale
  res = x.new(x)
  ctx.mark_shared_storage((x, res))
  return res

@staticmethod
  def backward(ctx, grad):
  return grad * ctx.scale, None`

on reading the paper i have either completely missed or not come across something where we had to scale the gradient flowing back,
Please point out where i can get clarity on its use and why as to it has positive impact on the result.

@piyank22
Copy link
Author

[https://github.com/facebookresearch/fairseq-py/blob/master/fairseq/models/fconv.py#L118](Line in fconv.py)

@piyank22 piyank22 closed this as completed Oct 1, 2017
@gaopeng-eugene
Copy link

Why you scale the gradient ? Any explanation ?

@aa1607
Copy link

aa1607 commented Apr 24, 2018

In case you were still wondering, it says here in the paper: "For convolutional decoders with multiple attention, we scale the gradients for the encoder layers by the number of attention mechanisms we use; we exclude source word embeddings. We found this to stabilize learning since the encoder received too much gradient otherwise."

https://arxiv.org/pdf/1705.03122.pdf

myleott pushed a commit that referenced this issue Jun 26, 2018
Train on multiple GPUs with computation happening on a separate Python Process
per model (using multiprocessing) to avoid the GIL.

Achieves ~6x speedup using 8 GPUs. Speed is comparable to Lua version:

Version  | # GPUs | effective bsz | words/s | # epochs | train loss | valid loss | BLEU
---------|--------|---------------|---------|----------|------------|------------|-----
PyTorch  | 1      | 32            | 5.6k    | -        | -          | -          | -
LuaTorch | 1      | 32            | 5.4k    | -        | -          | -          | -
PyTorch  | 4      | 128           | 18.5k   | 24       | 2.56       | 3.17       | 29.85
LuaTorch | 4      | 128           | 18.6k   | 18       | 2.80       | 3.20       | 28.58
PyTorch  | 8      | 256           | 33.0k   | -        | -          | -          | -
LuaTorch | 8      | 256           | 32.0k   | -        | -          | -          | -

Above results are on K80s with cuDNNv5 and NCCLv1.
taylanbil added a commit to taylanbil/fairseq that referenced this issue Oct 28, 2019
taylanbil added a commit to taylanbil/fairseq that referenced this issue Nov 13, 2019
optimizer fix
progress bar comment out temporarily
some changes to train_tpu
int mask instead of float

pfpfpfpf

fix

printing device index per loop

bkpt to investigate resize_ call

attempting to init buffer size to 2*dim

bkpt

better print

do not drop records when computing loss

Changes that reduce graph compiles.

* Loss function replaced with an equivalent logic that doesn't resize
tensors.
* cli args changed to guarantee consistency
* collate_tokens function in fairseq/data/data_utils.py overwritten to
guarantee consistency

undoing some changes made while debugging

progress_bar implements len

some irrelevant changes to train_tpu.py

new xla changes

bug fix in enable_torch_version

removing the last batch that is of diferent size from the iterator

delete optimizer step in fairseq s trainer

Added `self.xla` flag that controls if Trainer includes optimizer step

+ Tried to include more explanation why skip optimizer step this time

deleted obsolete file

add norm clipping count back in (#4)

remove grad norm clip count (#5)

Change masked_fill_ input in loss in order to accomodate necessary pytorch changes (#6)

Adding tpu capabilities to train.py (facebookresearch#8)

* Adding tpu capabilities to train.py

* flush when printing for better user experience

* separated cli_main into parse_args, maingpu and maintpu
deleted unused line in datautils.py

Enumerate the loader in training and validation (facebookresearch#9)

* Adding tpu capabilities to train.py

* flush when printing for better user experience

* separated cli_main into parse_args, maingpu and maintpu
deleted unused line in datautils.py

* Enumerate the loader

* enumerate the loader

Add option to assert on training and/or validation loss (facebookresearch#10)

* Add option to assert on training and/or validation loss

* applied suggestion

None loss should be filled to inf (facebookresearch#11)

Enabling multiprocessing for fairseq training. (facebookresearch#12)

* initial commit for multiprocess api

* indentation fixes and import fix

* no need to softlink, fix save/load

* Remove the hacks to only save from master ordinal as xm.save takes care of that

* fix indentation; 3 -> 4 spaces

* Moved xu.eprints after spawn and dropping last batches better

trainers->trainer (facebookresearch#13)

fix bug in assert_on_losses

Replace usage of unsqueeze with transpose + broadcasting (facebookresearch#15)

remove attn mask + loss rewrite + save per host +

format
suppress loss report
allow usage of batch_by_size in translation.
attn_weights masked fill in place

Clean up the log output suppressing a bit

Revert multihead attn's in_proj code changes

non-rebased tpu branch is about 10% faster on TPUs
compared to the rebased branch. The regression is inside multihead
attn's in_proj mechanism. Reverting the relevant changes to preserve
performance.

Pass correct args to the new get_valid_stats function

Send meters to device in order not to fail training when resuming dfrom chkpt
yfyeung pushed a commit to yfyeung/fairseq that referenced this issue Dec 6, 2023
…acebookresearch#13)

* Add grad_clip and weight-decay, small fix of dataloader and masking

* Add RESULTS.md
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants