[Bug] AffineInputTransform causes retain_graph errors when previous transforms have parameters #1635

mrcslws · 2023-01-20T15:43:50Z

🐛 Bug

If you use an AffineInputTransform (like Normalize or InputStandardize) after an input transform with learned parameters, bad things happen. The AffineInputTransform will unintentionally store values that have autograd compute graphs, because this class was not written to expect X to have gradient information during training.

This has two effects:

During inference, when calling loss.backward() to optimize X, there are retain_graph errors.
It needlessly slows the model down, as pytorch will compute gradients for AffineInputTransform's buffers during optimize_acqf

This bug will occur any time X has gradient information during training, so there may be other user scenarios where this happens.

To reproduce

import botorch
import gpytorch
import torch

print(f"botorch: {botorch.__version__}")
print(f"gpytorch: {gpytorch.__version__}")
print(f"torch: {torch.__version__}")

train_X = torch.tensor([[0., 0.],
                        [0.1, 0.1],
                        [0.5, 0.5]])
train_Y = torch.tensor([[0.1],
                        [0.5],
                        [1.0]])
test_X = torch.tensor([[0.3, 0.3]])


# Workaround: change this to True
use_workaround = False
if use_workaround:
    class PatchedNormalize(botorch.models.transforms.Normalize):
        """
        Fix a bug in botorch's Normalize
        """
        def train(self, mode=True):
            if not mode:
                # During training, previous transforms have caused X to have a
                # backprop compute graph, which is then propagated into the
                # coefficient and offset.
                self.coefficient = self.coefficient.detach()
                self.offset = self.offset.detach()

            super().train(mode)

    normalize = PatchedNormalize(2)
else:
    normalize = botorch.models.transforms.Normalize(2)


input_transform = botorch.models.transforms.ChainedInputTransform(
    # Example input transform with learned parameters
    warp=botorch.models.transforms.Warp(indices=[0, 1]),
    normalize=normalize,
)

model = botorch.models.SingleTaskGP(
    train_X, train_Y,
    input_transform=input_transform
)

botorch.fit_gpytorch_mll(
    gpytorch.mlls.ExactMarginalLogLikelihood(model.likelihood, model)
)

posterior = model.posterior(test_X)

# Silly placeholder loss function; bug occurs with any loss function
loss = -(posterior.mean - posterior.variance).sum()
loss.backward()

Error message:

RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need to backward through the graph a second time or if you need to access saved tensors after calling backward.

Expected Behavior

botorch is correct to propagate gradient information into coefficient and offset for use during training, but the saved version of these values (for use during inference) should not contain gradient information. The code should be refactored so that it continues to use offset and coefficient with gradient information, but it should always call detach() before storing these on the self.

System information

botorch: 0.7.3.dev22+g208470e7.d20221102
gpytorch: 1.9.1.dev32+g23b068b5
torch: 1.13.1
MacOS 13.1 M1

The text was updated successfully, but these errors were encountered:

Balandat · 2023-01-21T19:47:44Z

Thanks for this most excellent bug report & proposed fix! We can certainly patch this in.

@saitcakmak do you think it make sense to do this now or should this just go into your planned transform refactor / upstream?

saitcakmak · 2023-01-24T04:21:16Z

Thanks for flagging this!

These coefficients should never have gradient on them, right? This is an affine transform, so the gradient will also be subject to the same transform when these do not have any gradient attached. If they do, that'll only lead to issues, as discovered here.

Adding detach in a few places should handle this. I can put that together in a bit.

Summary: See pytorch#1635. These should never retain grad. Differential Revision: D42700421 fbshipit-source-id: e3057ac5e4ff50700840497b52f1ff807544d4ca

mrcslws · 2023-01-24T14:54:24Z

Here's an example that illustrates why I think gradient information should flow through the Normalize coefficients / offsets.

Suppose you have two hyperparameters, num_epochs_high_lr and num_epochs_low_lr, and you want to append a learned feature w1*num_epochs_high_lr + w2*num_epochs_low_lr, with w1 and w2 learned during training. Consider the impact of increasing w1. Increasing w1 has two effects:

It increases the value of this learned feature.
It increases the max value of this learned feature, so the Normalize causes the learned feature's value to lower for many train_X.

Thus the actual effect of increasing w1 is to increase some train_X feature values and decrease other feature values. However, if you've not allowed gradients to flow through the Normalize coefficients / offset, you've removed this information from the gradient, and the optimization will be aware of Effect 1 but not Effect 2.

I agree that it's strange for the cached values to ever contain grad info. My personal preference is to refactor to use the following pattern:

# Pseudocode
if self.training:
    offset, coefficient = compute_these()
    self.offset = offset.detach()
    self.coefficient = coefficient.detach()
else:
    offset = self.offset
    coefficient = self.coefficient

# now use offset and coefficient (not self.offset and self.coefficient)

SebastianAment · 2023-01-24T15:37:51Z

Thanks @mrcslws for reporting this and @saitcakmak for putting up the PR!

I agree that this makes sense for Normalize and Standardize, but believe that a generic AffineInputTransform should by default maintain and propagate gradient information. By generic, I mean an affine transport whose coefficients and offsets are computed by a different mechanism than Normalize and Standardize, which we could want to optimize. An application I am working on has such a case.

So I suggest we call the _detach_buffers helper in #1642 only for Normalize and Standardize, but not AffineInputTransform.

saitcakmak · 2023-01-24T20:22:03Z

@SebastianAment does @mrcslws's suggestion work for your case as well? It'd let you have trainable coefficient and offset but would not have the issue of gradients leaking into future computations.

SebastianAment · 2023-01-24T21:27:40Z

That should work! We could also make the buffers private (coefficient -> _coefficient) and define

@property
def coefficient(self) -> Tensor:
    coeff = self._coefficient
    return coeff if self.learn_coefficients and self.training else coeff.detach()

@property
def offset(self) -> Tensor:
    offset = self._offset
    return offset if self.learn_coefficients and self.training else offset.detach()

When someone needs gradient information of the coefficients, this structure would allow them to access it independent of the _update_coefficients call.

Balandat · 2023-01-30T02:17:57Z

That should work! We could also make the buffers private (coefficient -> _coefficient)

This seems like a good solution to me! @saitcakmak if this makes sense to you as well could you incorporate that into #1642?

esantorella · 2023-01-30T17:16:21Z

Any thoughts on how we can test that necessary and correct gradient info is propagated and retained? And how we can test that it's not persisted where it's not needed? (The retain_graph error in the original post could provide one test.)

…h#1642) Summary: Pull Request resolved: pytorch#1642 See pytorch#1635. These should only retain grad if learning the coefficient and while in train mode. Differential Revision: D42700421 fbshipit-source-id: 907f72a6f585f55f1496e23f58bc463eac063860

Summary: Pull Request resolved: #1642 See #1635. These should only retain grad if learning the coefficient and while in train mode. Reviewed By: SebastianAment Differential Revision: D42700421 fbshipit-source-id: 3853071f256b268ef46a239a91b13486199d6cac

saitcakmak · 2023-02-07T21:51:10Z

Closing this since #1642 landed

mrcslws added the bug Something isn't working label Jan 20, 2023

esantorella mentioned this issue Jan 23, 2023

[Bug] "RuntimeError: Trying to backward through the graph a second time" error in fit_gpytorch_mll #1386

Closed

saitcakmak mentioned this issue Jan 24, 2023

Detach coefficient and offset in AffineTransform #1642

Closed

esantorella added the WIP Work in Progress label Jan 30, 2023

saitcakmak linked a pull request Jan 30, 2023 that will close this issue

Detach coefficient and offset in AffineTransform #1642

Closed

saitcakmak closed this as completed Feb 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] AffineInputTransform causes retain_graph errors when previous transforms have parameters #1635

[Bug] AffineInputTransform causes retain_graph errors when previous transforms have parameters #1635

mrcslws commented Jan 20, 2023

Balandat commented Jan 21, 2023

saitcakmak commented Jan 24, 2023

mrcslws commented Jan 24, 2023 •

edited

Loading

SebastianAment commented Jan 24, 2023

saitcakmak commented Jan 24, 2023

SebastianAment commented Jan 24, 2023

Balandat commented Jan 30, 2023

esantorella commented Jan 30, 2023

saitcakmak commented Feb 7, 2023

[Bug] AffineInputTransform causes retain_graph errors when previous transforms have parameters #1635

[Bug] AffineInputTransform causes retain_graph errors when previous transforms have parameters #1635

Comments

mrcslws commented Jan 20, 2023

🐛 Bug

To reproduce

Expected Behavior

System information

Balandat commented Jan 21, 2023

saitcakmak commented Jan 24, 2023

mrcslws commented Jan 24, 2023 • edited Loading

SebastianAment commented Jan 24, 2023

saitcakmak commented Jan 24, 2023

SebastianAment commented Jan 24, 2023

Balandat commented Jan 30, 2023

esantorella commented Jan 30, 2023

saitcakmak commented Feb 7, 2023

mrcslws commented Jan 24, 2023 •

edited

Loading