Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] AffineInputTransform causes retain_graph errors when previous transforms have parameters #1635

Closed
mrcslws opened this issue Jan 20, 2023 · 9 comments
Labels
bug Something isn't working WIP Work in Progress

Comments

@mrcslws
Copy link

mrcslws commented Jan 20, 2023

🐛 Bug

If you use an AffineInputTransform (like Normalize or InputStandardize) after an input transform with learned parameters, bad things happen. The AffineInputTransform will unintentionally store values that have autograd compute graphs, because this class was not written to expect X to have gradient information during training.

This has two effects:

  1. During inference, when calling loss.backward() to optimize X, there are retain_graph errors.
  2. It needlessly slows the model down, as pytorch will compute gradients for AffineInputTransform's buffers during optimize_acqf

This bug will occur any time X has gradient information during training, so there may be other user scenarios where this happens.

To reproduce

import botorch
import gpytorch
import torch

print(f"botorch: {botorch.__version__}")
print(f"gpytorch: {gpytorch.__version__}")
print(f"torch: {torch.__version__}")

train_X = torch.tensor([[0., 0.],
                        [0.1, 0.1],
                        [0.5, 0.5]])
train_Y = torch.tensor([[0.1],
                        [0.5],
                        [1.0]])
test_X = torch.tensor([[0.3, 0.3]])


# Workaround: change this to True
use_workaround = False
if use_workaround:
    class PatchedNormalize(botorch.models.transforms.Normalize):
        """
        Fix a bug in botorch's Normalize
        """
        def train(self, mode=True):
            if not mode:
                # During training, previous transforms have caused X to have a
                # backprop compute graph, which is then propagated into the
                # coefficient and offset.
                self.coefficient = self.coefficient.detach()
                self.offset = self.offset.detach()

            super().train(mode)

    normalize = PatchedNormalize(2)
else:
    normalize = botorch.models.transforms.Normalize(2)


input_transform = botorch.models.transforms.ChainedInputTransform(
    # Example input transform with learned parameters
    warp=botorch.models.transforms.Warp(indices=[0, 1]),
    normalize=normalize,
)

model = botorch.models.SingleTaskGP(
    train_X, train_Y,
    input_transform=input_transform
)

botorch.fit_gpytorch_mll(
    gpytorch.mlls.ExactMarginalLogLikelihood(model.likelihood, model)
)

posterior = model.posterior(test_X)

# Silly placeholder loss function; bug occurs with any loss function
loss = -(posterior.mean - posterior.variance).sum()
loss.backward()

Error message:

RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need to backward through the graph a second time or if you need to access saved tensors after calling backward.

Expected Behavior

botorch is correct to propagate gradient information into coefficient and offset for use during training, but the saved version of these values (for use during inference) should not contain gradient information. The code should be refactored so that it continues to use offset and coefficient with gradient information, but it should always call detach() before storing these on the self.

System information

botorch: 0.7.3.dev22+g208470e7.d20221102
gpytorch: 1.9.1.dev32+g23b068b5
torch: 1.13.1
MacOS 13.1 M1

@mrcslws mrcslws added the bug Something isn't working label Jan 20, 2023
@Balandat
Copy link
Contributor

Thanks for this most excellent bug report & proposed fix! We can certainly patch this in.

@saitcakmak do you think it make sense to do this now or should this just go into your planned transform refactor / upstream?

@saitcakmak
Copy link
Contributor

Thanks for flagging this!

These coefficients should never have gradient on them, right? This is an affine transform, so the gradient will also be subject to the same transform when these do not have any gradient attached. If they do, that'll only lead to issues, as discovered here.

Adding detach in a few places should handle this. I can put that together in a bit.

saitcakmak added a commit to saitcakmak/botorch that referenced this issue Jan 24, 2023
Summary: See pytorch#1635. These should never retain grad.

Differential Revision: D42700421

fbshipit-source-id: e3057ac5e4ff50700840497b52f1ff807544d4ca
@mrcslws
Copy link
Author

mrcslws commented Jan 24, 2023

Here's an example that illustrates why I think gradient information should flow through the Normalize coefficients / offsets.

Suppose you have two hyperparameters, num_epochs_high_lr and num_epochs_low_lr, and you want to append a learned feature w1*num_epochs_high_lr + w2*num_epochs_low_lr, with w1 and w2 learned during training. Consider the impact of increasing w1. Increasing w1 has two effects:

  1. It increases the value of this learned feature.
  2. It increases the max value of this learned feature, so the Normalize causes the learned feature's value to lower for many train_X.

Thus the actual effect of increasing w1 is to increase some train_X feature values and decrease other feature values. However, if you've not allowed gradients to flow through the Normalize coefficients / offset, you've removed this information from the gradient, and the optimization will be aware of Effect 1 but not Effect 2.

I agree that it's strange for the cached values to ever contain grad info. My personal preference is to refactor to use the following pattern:

# Pseudocode
if self.training:
    offset, coefficient = compute_these()
    self.offset = offset.detach()
    self.coefficient = coefficient.detach()
else:
    offset = self.offset
    coefficient = self.coefficient

# now use offset and coefficient (not self.offset and self.coefficient)

@SebastianAment
Copy link
Contributor

Thanks @mrcslws for reporting this and @saitcakmak for putting up the PR!

I agree that this makes sense for Normalize and Standardize, but believe that a generic AffineInputTransform should by default maintain and propagate gradient information. By generic, I mean an affine transport whose coefficients and offsets are computed by a different mechanism than Normalize and Standardize, which we could want to optimize. An application I am working on has such a case.

So I suggest we call the _detach_buffers helper in #1642 only for Normalize and Standardize, but not AffineInputTransform.

@saitcakmak
Copy link
Contributor

@SebastianAment does @mrcslws's suggestion work for your case as well? It'd let you have trainable coefficient and offset but would not have the issue of gradients leaking into future computations.

@SebastianAment
Copy link
Contributor

That should work! We could also make the buffers private (coefficient -> _coefficient) and define

@property
def coefficient(self) -> Tensor:
    coeff = self._coefficient
    return coeff if self.learn_coefficients and self.training else coeff.detach()

@property
def offset(self) -> Tensor:
    offset = self._offset
    return offset if self.learn_coefficients and self.training else offset.detach()

When someone needs gradient information of the coefficients, this structure would allow them to access it independent of the _update_coefficients call.

@Balandat
Copy link
Contributor

That should work! We could also make the buffers private (coefficient -> _coefficient)

This seems like a good solution to me! @saitcakmak if this makes sense to you as well could you incorporate that into #1642?

@esantorella
Copy link
Member

Any thoughts on how we can test that necessary and correct gradient info is propagated and retained? And how we can test that it's not persisted where it's not needed? (The retain_graph error in the original post could provide one test.)

@esantorella esantorella added the WIP Work in Progress label Jan 30, 2023
saitcakmak added a commit to saitcakmak/botorch that referenced this issue Jan 30, 2023
…h#1642)

Summary:
Pull Request resolved: pytorch#1642

See pytorch#1635. These should only retain grad if learning the coefficient and while in train mode.

Differential Revision: D42700421

fbshipit-source-id: 907f72a6f585f55f1496e23f58bc463eac063860
@saitcakmak saitcakmak linked a pull request Jan 30, 2023 that will close this issue
facebook-github-bot pushed a commit that referenced this issue Jan 31, 2023
Summary:
Pull Request resolved: #1642

See #1635. These should only retain grad if learning the coefficient and while in train mode.

Reviewed By: SebastianAment

Differential Revision: D42700421

fbshipit-source-id: 3853071f256b268ef46a239a91b13486199d6cac
@saitcakmak
Copy link
Contributor

Closing this since #1642 landed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working WIP Work in Progress
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants