Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sparse optimizers #1285

Closed
adamlerer opened this issue Apr 18, 2017 · 18 comments
Closed

sparse optimizers #1285

adamlerer opened this issue Apr 18, 2017 · 18 comments
Labels
feature A request for a proper, new feature. module: optimizer Related to torch.optim module: sparse Related to torch.sparse triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@adamlerer
Copy link
Contributor

Now that sparse tensors are mostly working (#1147), it would be awesome if all the optimizers worked with sparse tensors. This requires some cleverness to do amortized updates to the parameters. For example, for weight decay, you would do something like (pseudocode):

# apply weight decay
# dp is sparse
n_t++
for i in dp.indices():
  p[i] *= (1 - lr * wd)^(n_t - n[i])
  n[i] = n_t

Note this isn't exactly equivalent (to be equivalent you'd need to apply the weight decay before the forward pass, not after backwards), but it's a good approximation. You can do the same thing for momentum.

I'm guessing the same thing works for Adam/Adamax as well but I haven't worked through the equations. https://arxiv.org/pdf/1412.6980.pdf

@ezyang expressed interest in working on this.

@ezyang
Copy link
Contributor

ezyang commented Apr 19, 2017

Here is an example test case for optimization when we have sparse gradients, done by "knocking out" the x or y coordinate on the computed gradient and putting the rest in a sparse gradient (I did it in a kind of janky way; ways to improve this code stylistically much appreciated!) When we get around to testing with weight decay I'll have to replace the assertEqual with a more approximate test, since the strategy we plan on using won't give us exactly the same results as the dense version.

    def _test_rosenbrock_sparse(self, constructor):
        params_t = torch.Tensor([1.5, 1.5])

        params   = Variable(torch.Tensor([1.5, 1.5]), requires_grad=True)
        params_c = Variable(torch.Tensor([1.5, 1.5]), requires_grad=True)
        optimizer   = constructor([params])
        optimizer_c = constructor([params_c])

        solution = torch.Tensor([1, 1])
        initial_dist = params.data.dist(solution)

        def eval(params, sparse_grad, w):
            optimizer.zero_grad()
            loss = rosenbrock(params)
            loss.backward()
            # params.grad.data now has the computed gradient. Turn
            # it into a sparse tensor
            params.grad.data.copy_(drosenbrock(params.data))
            # This is really goofy
            if w:
                i = torch.LongTensor([[0]])
                v = torch.DoubleTensor([params.grad.data[0]])
            else:
                i = torch.LongTensor([[1]])
                v = torch.DoubleTensor([params.grad.data[1]])
            x = sparse.DoubleTensor(i, v, torch.Size([2]))
            if sparse_grad:
                params.grad.data = x
            else:
                params.grad.data = x.to_dense()
            return loss

        for i in range(2000):
            w = torch.rand(1)[0] > 0.5
            optimizer.step(functools.partial(eval, params, True, w))
            optimizer_c.step(functools.partial(eval, params_c, False, w))
            self.assertEqual(params.data, params_c.data)

        self.assertLessEqual(params.data.dist(solution), initial_dist)

@ezyang
Copy link
Contributor

ezyang commented Apr 19, 2017

Discussion with @adamlerer: if we amortize sparse updates in this way, we need some sort of "flushing" operation which a user can call to update the tensor to the "correct value" (since we may have accumulated updates on parameters that haven't been applied to the underlying tensor itself.) This is a change to the API for optimizers (a new flush function a user has to call at the end of optimizing) but since no one is actually optimizing on sparse models, this doesn't seem like a bad thing to ask them to do.

@ezyang
Copy link
Contributor

ezyang commented Apr 20, 2017

Now that I feel that I have a much better grasp on what is going on with the optimizers, let me give an expanded "state of play":

  • Adagrad is already converted to use sparse updates. https://github.com/pytorch/pytorch/pull/735/files#diff-533671a4db549248f40d40c6b5473916 This made the code a bit more convoluted, since the naive implementation converts the sparse tensors into their dense representations too early when it calls addcmul (it would be nice to make this nicer, though it is not obvious to me what strategy should be taken here.) However, adagrad is a "nice" algorithm, in the sense that when the gradient is sparse, the final update is sparse as well.

  • Adam, however, the updates to the parameters are NOT sparse, even when the gradients are sparse, because it accumulates moments, which means a full update to the moments (to decay the old values) and the parameters (This was also observed by @martinraison). Indeed, TF has had to grapple with this issue RMSProp optimization support for sparse tensors tensorflow/tensorflow#464; at the time this bug was written, TF's Adam implementation did a dense update, causing performance problems for other users Slow Adam sparse updates in distributed TF tensorflow/tensorflow#6460. Eventually, to solve this problem, they merged a "lazy Adam optimizer" tensorflow/tensorflow@819a690d . The way this optimizer works is (1) only moments that show up in the gradient get updated, and (2) only those portions of the gradient get applied to the parameters.

    We could just implement this algorithm directly, but Adam's original suggestion about how to handle spare updates with weight decay got me thinking about how we might be able to play similar tricks in Adam's case, by applying the deferred momentum updates when we actually update the parameter in question (it's just a geometric series.) That's not a good idea: the point of momentum is to apply an extra force until the gradient tells us to stop. If we're training on sparse data and batch up momentum, you could easily shoot way past the optimal point before you got more information from gradient to go the other way. So we probably do want to go as TF goes.

  • Reading about Nesterov, it seems to me that this trick should also apply when we are applying deferred updates from weight decay / momentum / etc. The big problem is that unlike with Nesterov, we don't know which deferred updates we are going to apply until we've computed our gradient once. If there was a way to quickly find out what the indices of the sparse update would be (without bothering to compute the entire gradient), we could first immediately apply any pending momentum/weight decay updates, and THEN compute the gradient on the new point (just like as with Nesterov.) I don't know enough about the autograd mechanism to know how easy or hard this would be to do.

  • A simple matter: lazy weight decay should apply all optimizers. It will be annoying if we have to reimplement it each time, so a mixin is probably in order.

It will probably be interesting to see what TF is doing for all of the other optimizers we have.

ezyang referenced this issue in ezyang/pytorch-unattached Apr 28, 2017
Towards fixing #1285.

This is a WIP commit to get the ball rolling on code review
(I am sure I have done great violence to the various coding
standards of your project.)

Things to be done:

    - Continue adding sparse support for other parameters
      and optimizers

    - Add some more tests, including a unit test ensuring
      that a single step is what we expect

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
ezyang referenced this issue in ezyang/pytorch-unattached May 8, 2017
Towards fixing #1285.

This is a WIP commit to get the ball rolling on code review
(I am sure I have done great violence to the various coding
standards of your project.)

Things to be done:

    - Continue adding sparse support for other parameters
      and optimizers

    - Add some more tests, including a unit test ensuring
      that a single step is what we expect

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
@soumith
Copy link
Member

soumith commented Aug 26, 2017

@jph00
Copy link

jph00 commented Aug 26, 2017

In the meantime, a documentation update would be helpful to document the sparse parameter to nn.Embedding and mention where it can't be used (weight decay and some optimizers).

@jph00
Copy link

jph00 commented Aug 31, 2017

I assume weight decay should work the same way as @ezyang points out lazy adam should be done. No reason to decay weights in batches that they're not, used, right?

@apaszke
Copy link
Contributor

apaszke commented Aug 31, 2017

Well there might be a reason - if you do it lazily, the weight vector is never up to date, so the statistics you might compute (e.g. weight distribution) would be incorrect

@adamlerer
Copy link
Contributor Author

adamlerer commented Aug 31, 2017

My 2c:

I thought about it a bit and this naive lazy update kind of makes sense for Adam. If you apply all of the updates (m, v, p) sparsely, then indeed this should behave pretty similarly (probably even superior to) dense Adam. I would argue that you shouldn't be using a momentum method with sparse updates, but whatever...

Weight decay is different. It's a component of the loss, and if you do it sparsely, you are no longer computing weight decay, you're computing something else. There's a very good reason why weight decay should be applied to all parameters: the whole point of L2 regularization is that it should drive down the value of parameters g_i for which dL/dg_i is small. If you have a parameter which is almost always "sparse" (i.e. dL/dg_i is 0 most of the time), then it should be pushed down strongly (i.e. weight decay should be applied at each gradient update, although this can be done in an amortized fashion to save on compute).

@fabianp
Copy link

fabianp commented Aug 31, 2017

[apologies for hijacking the conversation] Just wanted to point out that we recently developed a technique to deal with sparse updates here and here which avoids all the bookkeeping required for the lazy updates. It was originally developed for asynchronous algorithms (where the lazy updates are not an option) but it works also great for sequential algorithms. I believe this technique should be applicable also to the case of Adam but I haven't looked into it yet. It's more of a research project, but if someone is interesting in this, just drop me a line. [/end hijacking the conversation]

@jph00
Copy link

jph00 commented Sep 1, 2017

Well... I do want to point out that the Tensorflow implementation works extremely well for embedding layers. It's very helpful to have both momentum methods and weight decay in embedding layers, but the current pytorch sparse approach doesn't work at all in this case. If I remove weight decay and use adagrad (which works with sparse layers) I don't get good results. With tensorflow (using keras) I've been able to build text classification (sentiment analysis) models that are far more accurate, have less parameters, and train more quickly and reliably using their approach to sparse updates. To be clear: they're far better regardless of whether I'm using weight decay - their approach to sparse updates seems to be giving much better results (and I've confirmed that other details like initializers are the same).

The goal here shouldn't be to perform only as well as dense adam (which performs very poorly in this case), but as well as tensorflow's sparse adam (and sparse momentum, and sparse weight decay, etc).

Whilst I'm not sure about @adamlerer's theory, I do think that a lazy approach should at least be benchmarked against the simple gradient masking approach to see when/if it's actually better. If it's not better for real examples, then it's probably not a good idea.

Just my $0.02...

@adamlerer
Copy link
Contributor Author

@jph00 thanks for the anecdotal results, that's really good to know. Are you sure that it's LazyAdam in particular that leads to good embedding layers (vs. some other TF vs PT difference)? I.e. have you trained the same TF with DenseAdam / Adagrad / SGD and observed substantially worse embeddings?

To clarify, you say you use sparse momentum + weight decay, but I don't see any weight decay option in tensorflow/tensorflow@819a690d , so how are you applying weight decay? Is it dense or sparse?

Re: my thoughts, I guess my main request is that if S is a sparse tensor then it should be the case that F(S) = F(toDense(S)) for all operators F in pytorch, unless specially marked otherwise. So weight decay shouldn't magically be different for sparse tensors. But if there's some other optimizer that works better for embeddings, by all means lets implement it. (Note: maybe we shouldn't assume that any sparse gradient comes from an embedding...)

@jph00
Copy link

jph00 commented Sep 1, 2017

It's definitely not just lazyadam - I'm guessing it's just sparse_apply in general. But I don't know the exact source of the difference; at this point all I can say for sure is that keras, with tf backend, trains simple models with embeddings extremely quickly and extremely accurately compared to pytorch, and allows the use of weight decay and any optimizer. I've only briefly scanned the keras code, but it seems to generally be just passing stuff down to tf with minimal fiddling, so I was guessing that the reason must be in how tf is handling the embeddings. But I don't know that for sure.

I also noticed that setting sparse=True in pytorch resulted in better results for simple models than setting sparse=False. So even in pytorch something about that approach is resulting in improvements.

Which is to say: trying to match dense results doesn't seem like it should be the goal here...

@soumith
Copy link
Member

soumith commented Sep 1, 2017

Jeremy, is it possible to share your Keras & PyTorch scripts (if they are not proprietary and stuff). I'm down to take a deeper look, and work on PyTorch features to get to parity.

@jph00
Copy link

jph00 commented Sep 1, 2017

Of course, I'd love to! I don't have anything proprietary - it'll all be part of the course... Let me spend some time today trying to create something as clear and concise as I can. I'll add a link here when done.

@jph00
Copy link

jph00 commented Sep 1, 2017

Ugh, I spent the last 4 hours trying to create a clear example, and ended up finding that the key issue is actually in the difference between torchtext and keras padding (keras pre-pads by default, torchtext post-pads). I had no idea this could be such a big issue - but it fully explains the differences I was seeing! So sorry to waste your time with my incorrect finger-pointing at the sparse updates.

I'm not going to fully back-track and say that updates to just the embeddings that were used isn't a good idea; I don't have the data either way on this. I hope sometime after the next course to have time to go back over this and do some proper experiments.

I will say however that the fact that keras/tf handles these models so quickly regardless of weight decay, adam, etc, is pretty damn cool, although I'm not really sure how they do it!

@weiyangfb weiyangfb added the module: sparse Related to torch.sparse label Sep 12, 2018
@weiyangfb weiyangfb self-assigned this Nov 28, 2018
@ezyang ezyang added the feature A request for a proper, new feature. label Apr 1, 2019
@ezyang ezyang removed the enhancement label Apr 1, 2019
@Kaixhin
Copy link
Contributor

Kaixhin commented Jun 7, 2019

Would like to see SparseRMSprop. Talked to Alex Pritzel about reproducing Neural Episodic Control, and he mentioned specifically that they update the Differentiable Neural Dictionary with sparse updates in TensorFlow, so only accessed keys have gradient averages calculated. Given that the DND is used in other work as well this optimiser would be nice to have.

@jeffreyksmithjr jeffreyksmithjr added module: optimizer Related to torch.optim triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module and removed medium priority (this tag is deprecated) labels Jun 10, 2019
@Santosh-Gupta
Copy link

Santosh-Gupta commented Sep 16, 2019

Would like to see SparseRMSprop. Talked to Alex Pritzel about reproducing Neural Episodic Control, and he mentioned specifically that they update the Differentiable Neural Dictionary with sparse updates in TensorFlow, so only accessed keys have gradient averages calculated. Given that the DND is used in other work as well this optimiser would be nice to have.

@Kaixhin

Sparse RMSprop is feasible with a library I just released

https://github.com/Santosh-Gupta/SpeedTorch

@karlhigley
Copy link

karlhigley commented Sep 17, 2020

Weight decay is different. It's a component of the loss, and if you do it sparsely, you are no longer computing weight decay, you're computing something else. There's a very good reason why weight decay should be applied to all parameters: the whole point of L2 regularization is that it should drive down the value of parameters g_i for which dL/dg_i is small. If you have a parameter which is almost always "sparse" (i.e. dL/dg_i is 0 most of the time), then it should be pushed down strongly (i.e. weight decay should be applied at each gradient update, although this can be done in an amortized fashion to save on compute).

Computing weight decay this way is actually common when optimizing recommendation models like L2-regularized matrix factorization with SGD. The state of the PyTorch optimizers w.r.t. sparsity makes this a lot harder than it needs to be. Attempting to apply SGD weight decay to a sparse embedding layer currently gives the error:

File "torch/optim/sgd.py", line 99, in step
d_p = d_p.add(p, alpha=weight_decay)
RuntimeError: add(sparse, dense) is not supported. Use add(dense, sparse) instead.

Since there isn't another behavior currently implemented for this case, it seems like allowing sparse weight decay here wouldn't really hurt. Maybe the main concern would be avoiding a performance regression in the dense case?

jjsjann123 added a commit to jjsjann123/pytorch that referenced this issue Dec 5, 2021
Summary:
nvfuser code update:
1. Tuning heuristics on schedulers for reduction/normalization kernels;
2. bfloat16 on IO tensor support;
3. Refactored memory format support, now we can support dimension collapsing with non-coherent input tensors with different memory format. e.g. channels last tensor input to batch normalization. Note that we are currently limiting memory format to only Contiguous and Channels last;
4. Refactored nvfuser graph partitioning in `graph_fuser.cpp`, separated node merge and profile node API. Updated `profiling_record.cpp`.

Things that are reverted from our local branch:
1. changes on some entries in autodiff
2. aten::gelu with approximation
3. native_dropout(_backward)

Pull Request resolved: pytorch#67943

Reviewed By: ngimel

Differential Revision: D32288709

Pulled By: dzhulgakov

fbshipit-source-id: fc9491182ea7e0158bc112c66f096823c588eaf1
hubertlu-tw pushed a commit to hubertlu-tw/pytorch that referenced this issue Nov 1, 2022
hubertlu-tw pushed a commit to hubertlu-tw/pytorch that referenced this issue Nov 1, 2022
hubertlu-tw added a commit to hubertlu-tw/pytorch that referenced this issue Nov 1, 2022
* FusedRMSNorm/"T5LayerNorm" based on FusedLayerNorm (pytorch#1274)

* FusedRMSNorm based on FusedLayerNorm

* refactor duplicated kernels

* delete comments

* delete comments

* cleanup

* cleanup

* cleanup, fixed clobbering forward_affine_mixed_dtypes

* fix pybind naming and add MixedFused test

* undo skipping

* check elementwise_affine

* Update tests/L0/run_fused_layer_norm/test_fused_layer_norm.py

Oof, nice catch, thanks

Co-authored-by: Masaki Kozuki <masaki.kozuki.2014@gmail.com>

Co-authored-by: Masaki Kozuki <masaki.kozuki.2014@gmail.com>

* fix and generate docs for FusedRMSNorm (pytorch#1285)

* [FusedRMSNorm doc] document where epsilon is added (pytorch#1295)

* [FusedRMSNorm doc] add epsilon to formula

* correct

* better wording

* Fix some bugs

* Optimize HostRMSNormGradient and HostApplyRMSNorm for AMD GPUs

* Fix NaN issues in FusedRMSNorm

* Update test_fused_layer_norm.py

* Skip test_fused_layer_norm.TestAutocastFusedRMSNorm on ROCm

* Use at::cuda::warp_size() instead of at::cuda::getCurrentDeviceProperties()->warpSize

Co-authored-by: eqy <eddiey@nvidia.com>
Co-authored-by: Masaki Kozuki <masaki.kozuki.2014@gmail.com>
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
@ngimel ngimel closed this as completed Apr 28, 2023
petrex pushed a commit to petrex/pytorch that referenced this issue Aug 29, 2024
Co-authored-by: Wang, Yanyao <yanyao.wang@amd.com>
jagadish-amd pushed a commit to jagadish-amd/pytorch that referenced this issue Sep 5, 2024
Co-authored-by: Wang, Yanyao <yanyao.wang@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature A request for a proper, new feature. module: optimizer Related to torch.optim module: sparse Related to torch.sparse triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests