Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added single-objective JES with additional conditioning, removal of MO-specific code / additional info-theoretic utils #1738

Closed
wants to merge 27 commits into from

Conversation

hvarfner
Copy link
Contributor

@hvarfner hvarfner commented Mar 9, 2023

variable names consistent with MES/GIBBON. Input constructors for JES & JES-exploit added, as well as optimizer of pathwise posterior samples & utility method for retrieving optima.

Motivation

Refactoring of single-objective JES - larger codebase, but slightly faster, better support and more intuitive (?)

There is a substantial amount of multi-objective specific computation in single-objective JES. At its core, it's relatively similar to MES/GIBBON, and this suggested change tries to reflect this. It runs faster than previously (due to the removal of MO-specific stuff), and is (arguably) a bit more intuitive. Moreover, there is the option of running the two different conditioning variants (noisy or noiseless optimal values). The performance diff, however, is minor.

Also added the exploit-variant of JES for added robustness in cases of model misspecification.

Optimization of samples from the posterior

Added batch optimization of pathwise samples and utility method in acquisition/utils. Hyperparameters are set so as to (anecdotally/empirically) not run into memory or runtime issues on laptop, but still be peak performant.

Input constructors for JES

Support for a Fully Bayesian variant of JES - didn't quite work previously (given that condition_on_observations works with fully bayesian models - see incoming PR!

Have you read the Contributing Guidelines on pull requests?

Yes!

Test Plan

Have tested the related notebooks and Ax interfaces & performance of the two different JES methods.

Related PRs

Yes, proposed fix to FBGPs. Coming soon.

@facebook-github-bot
Copy link
Contributor

Hi @hvarfner!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks!

@facebook-github-bot facebook-github-bot added the CLA Signed Do not delete this pull request or issue due to inactivity. label Mar 9, 2023
@facebook-github-bot
Copy link
Contributor

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

@saitcakmak
Copy link
Contributor

Hi @hvarfner. Thanks for the PR! I'll try to get this reviewed within the next week. In the mean time: i) We require 100% unit test coverage for BoTorch, so you can work on those. ii) The .py file under tutorials is not needed. We auto generate a .py file from the notebook while uploading the website.

@hvarfner
Copy link
Contributor Author

@saitcakmak Thanks! The .py file in tutorials wasn't supposed to be in there. I'll work on the tests in the coming days!

Comment on lines 929 to 936
for iter_ in range(maxiter):
# Decaying the learing rate - empirically, this has been well-calibrated
lr = lr_init / (iter_ + 1)
per_sample_outputs = path_func(X_top_candidates)
grads = torch.autograd.grad(
per_sample_outputs, X_top_candidates, grad_outputs=torch.ones_like(per_sample_outputs))[0]
X_top_candidates = torch.clamp(
X_top_candidates + lr * grads, bounds[0], bounds[1]) # TODO fix bounds here
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sample paths here are deterministic (and smooth) functions - rather using a (manual) first-order gradient scheme here, we could probably get much faster convergence by using a (quasi)-second order method such as L-BFGS here.

Copy link
Contributor Author

@hvarfner hvarfner Mar 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will fix!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Balandat After giving this some more thought, it seems pretty tricky to implement this in batch with the set of tools available. Would a for-loop solution suffice?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like a for loop across each sample would be quite costly. Another option here would be to just define a function that sums the per_sample_outputs and then call L-BFGS-B on that (with the domain being k copies of the original domain, with k the number of sample paths). This is actually what we do for performing acquisition function optimization across random restarts:

def f_np_wrapper(x: np.ndarray, f: Callable):
"""Given a torch callable, compute value + grad given a numpy array."""
if np.isnan(x).any():
raise RuntimeError(
f"{np.isnan(x).sum()} elements of the {x.size} element array "
f"`x` are NaN."
)
X = (
torch.from_numpy(x)
.to(initial_conditions)
.view(shapeX)
.contiguous()
.requires_grad_(True)
)
X_fix = fix_features(X, fixed_features=fixed_features)
loss = f(X_fix).sum()
# compute gradient w.r.t. the inputs (does not accumulate in leaves)
gradf = _arrayify(torch.autograd.grad(loss, X)[0].contiguous().view(-1))
if np.isnan(gradf).any():
msg = (
f"{np.isnan(gradf).sum()} elements of the {x.size} element "
"gradient array `gradf` are NaN. "
"This often indicates numerical issues."
)
if initial_conditions.dtype != torch.double:
msg += " Consider using `dtype=torch.double`."
raise RuntimeError(msg)
fval = loss.item()
return fval, gradf

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I did try it some time ago with the torch L-BFGS but I must have done something wrong if it indeed works here. I'll explore it!

@Balandat Balandat requested a review from saitcakmak March 10, 2023 15:00
Copy link
Contributor

@Balandat Balandat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a super thorough review, my main comment is about using a new acquisition function class to implement something generic like epsilon-greedy.

Can you also make sure to run black on this?


class qExploitLowerBoundJointEntropySearch(qLowerBoundJointEntropySearch):
r"""The acquisition function for the Joint Entropy Search, where the batches
`q > 1` are supported through the lower bound formulation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really like the pattern of defining a whole new acquisition function class for the sole purpose of doing some epsilon-greedy optimization. It seems like this pattern would immediately double the number of acquisition functions if we wanted to do this more generally. I think this sort of notion of how to deal with model misspecification should live at a higher level and the acquisition functions themselves should not be aware of that. For that reason I would prefer we don't add this qExploitLowerBoundJointEntropySearch acquisition function.

Is the main reason you're defining this here so that it's easily useable in Ax via the acquisition function constructor? On the botorch side, this kind of epsilon greedy strategy can easily be implemented by slightly modifying the optimization loop. If this is challenging to do on the Ax side we should make changes there to make that easier (which we're already thinking about - cc @saitcakmak).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's for Ax, and I agree. Happy to remove. A middle-ground is to have it as an input constructor & generic acquisition function wrapper GreedyAcquisitionFunction(ActualAcquisitionFunction) otherwise.

I guess that's a separate PR nonetheless.


class qLowerBoundJointEntropySearch(qLowerBoundMultiObjectiveJointEntropySearch):

class qLowerBoundJointEntropySearch(AcquisitionFunction, MCSamplerMixin):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, glad we're not making the single-objective acqf a subclass of the multi-objective one any more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though it seems a bit odd that qLowerBoundJointEntropySearch is really only the right name if the estimation_type is indeed LB and not MC?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, changing.

Comment on lines 72 to 73
condition_noiseless:
posterior_transform:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO

Comment on lines 90 to 94
# inputs come as num_optima_per_model x num_models x d
# but we want it four-dimensional in the Fully bayesian case,
# and three-dimensional otherwise (optionally remove, then re-add)
self.optimal_inputs = optimal_inputs.squeeze(1).unsqueeze(-2)
self.optimal_outputs = optimal_outputs.squeeze(1).unsqueeze(-2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I fully understand what's going on here, can you explain this a bit more? In the non-fully-bayesian setting, I assume num_models is not a dimension, right? In which case the shape would be num_optima_per_model x 1 x d after this logic, but for the fully bayesian model it would be num_optima_per_model x num_models x 1 x d?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see this is mentioned below. Would be good to have the comment here though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! Some squeezes/unsqueezes cancelled out, so this should not be as confusing anymore.


per_sample_outputs = path_func(X_top_candidates).reshape(
num_optima * batch_size, num_restarts)
sample_argmax = torch.max(per_sample_outputs, dim=-1, keepdims=True)[1].flatten()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
sample_argmax = torch.max(per_sample_outputs, dim=-1, keepdims=True)[1].flatten()
sample_argmax = torch.max(per_sample_outputs, dim=-1, keepdims=True).indices.flatten()

which acts as extra initial guesses for the optimization routine.
raw_samples: The number of samples with which to query the samples initially.
num_restarts The number of gradient descent steps to use on each of the best
found initial points.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
found initial points.
found initial points.

Comment on lines 920 to 921
argtop_candidates = candidate_queries.argsort(dim=-1, descending=True)[
..., 0:num_restarts]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not use topk?

..., 0:num_restarts]

# These are used as masks when retrieving the argmaxes
row_indexer = torch.arange(num_optima * batch_size).to(torch.long)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be a nullop

Suggested change
row_indexer = torch.arange(num_optima * batch_size).to(torch.long)
row_indexer = torch.arange(num_optima * batch_size)

Comment on lines 929 to 936
for iter_ in range(maxiter):
# Decaying the learing rate - empirically, this has been well-calibrated
lr = lr_init / (iter_ + 1)
per_sample_outputs = path_func(X_top_candidates)
grads = torch.autograd.grad(
per_sample_outputs, X_top_candidates, grad_outputs=torch.ones_like(per_sample_outputs))[0]
X_top_candidates = torch.clamp(
X_top_candidates + lr * grads, bounds[0], bounds[1]) # TODO fix bounds here
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like a for loop across each sample would be quite costly. Another option here would be to just define a function that sums the per_sample_outputs and then call L-BFGS-B on that (with the domain being k copies of the original domain, with k the number of sample paths). This is actually what we do for performing acquisition function optimization across random restarts:

def f_np_wrapper(x: np.ndarray, f: Callable):
"""Given a torch callable, compute value + grad given a numpy array."""
if np.isnan(x).any():
raise RuntimeError(
f"{np.isnan(x).sum()} elements of the {x.size} element array "
f"`x` are NaN."
)
X = (
torch.from_numpy(x)
.to(initial_conditions)
.view(shapeX)
.contiguous()
.requires_grad_(True)
)
X_fix = fix_features(X, fixed_features=fixed_features)
loss = f(X_fix).sum()
# compute gradient w.r.t. the inputs (does not accumulate in leaves)
gradf = _arrayify(torch.autograd.grad(loss, X)[0].contiguous().view(-1))
if np.isnan(gradf).any():
msg = (
f"{np.isnan(gradf).sum()} elements of the {x.size} element "
"gradient array `gradf` are NaN. "
"This often indicates numerical issues."
)
if initial_conditions.dtype != torch.double:
msg += " Consider using `dtype=torch.double`."
raise RuntimeError(msg)
fval = loss.item()
return fval, gradf

@hvarfner
Copy link
Contributor Author

hvarfner commented Mar 29, 2023

@saitcakmak @Balandat There! I hope that it is all in order now. The support for fully Bayesian treatment is there as well, but it doesn't work as of now since the support isn't there. I'll happily keep an eye out to see if everything works as intended once the GPyTorch aspect of it is in place.

@facebook-github-bot
Copy link
Contributor

@esantorella has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@codecov
Copy link

codecov bot commented Mar 29, 2023

Codecov Report

Merging #1738 (5a9873e) into main (4d219d6) will not change coverage.
The diff coverage is 100.00%.

❗ Current head 5a9873e differs from pull request most recent head ccd2b1e. Consider uploading reports for the commit ccd2b1e to get more accurate results

@@            Coverage Diff             @@
##              main     #1738    +/-   ##
==========================================
  Coverage   100.00%   100.00%            
==========================================
  Files          170       170            
  Lines        14783     14912   +129     
==========================================
+ Hits         14783     14912   +129     
Impacted Files Coverage Δ
botorch/acquisition/input_constructors.py 100.00% <100.00%> (ø)
botorch/acquisition/joint_entropy_search.py 100.00% <100.00%> (ø)
botorch/acquisition/utils.py 100.00% <100.00%> (ø)
botorch/utils/sampling.py 100.00% <100.00%> (ø)

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Copy link
Contributor

@Balandat Balandat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! A number of largely nits. I'll take another look at the gpytorch PR for allowing conditioning on samples also in the batched model (fully Bayesian) case in the next few days.

Comment on lines +335 to +337
cdf_mvs = normal.cdf(normalized_mvs).clamp_min(CLAMP_LB)
cdf_rescaled_mvs = normal.cdf(mvs_rescaled_mc).clamp_min(CLAMP_LB)
mv_ratio = cdf_rescaled_mvs / cdf_mvs

log_term = torch.log(mv_ratio) + conditional_logprobs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm it would be good to just work with logcdfs from the get go here for numerical stability, but unfortunately I don't think torch has great support for those

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like my past self noticed that as well: pytorch/pytorch#52973

@Balandat
Copy link
Contributor

I'll take another look at the gpytorch PR for allowing conditioning on samples also in the batched model (fully Bayesian) case in the next few days.

For the time being, does your code just throw an error? If so, maybe you can add an informative NotImplementedError pointing to the discussion where we're trying to resolve this? That way we can potentially get this merged in before having everything figured out.

@hvarfner
Copy link
Contributor Author

That makes sense, I added the error as well. I realize that I forgot to mention a specific technical detail previously:

For each of M models, we'd want to condition on an additional N optima (each of which yields a new GP). As such, the (entire) conditional model would have M x N outputs. This works fine for a regular SingleTaskGP, as a batch dimension is simply added when optimal_inputs are of shape N x 1 x D (adds a batch dim, whereas shape N x D just adds N observations to one GP.

The expected behavior in the FB case would then be to have optimal_inputs of shape M x N x 1 x D. I'm not sure if this would work if you want to preserve one copy of the training data only.

Happy to contribute any which way to the FB issue.

@hvarfner
Copy link
Contributor Author

hvarfner commented Apr 1, 2023

After benchmarking a bit more, the method for obtaining posterior samples (which now uses gen_candidates_scipy) has issues. The quadratic scaling in the number of samples incurred by optimizing their sum (instead of individually) quickly becomes an issue in practice - tens of seconds to find ~64 samples when it should take one.

Personally, I'd like to find adress this issue, either in this PR or in a subsequent one. I think the best action for now is to just go with gen_candidates_torch.

@hvarfner hvarfner closed this Apr 1, 2023
@hvarfner hvarfner reopened this Apr 1, 2023
@facebook-github-bot
Copy link
Contributor

@esantorella has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@esantorella
Copy link
Member

Hi, thanks for the pull request! A couple things:

  • Could you run black and ufmt as described here to fix the errors the linter is throwing?
  • Could you take a look at the documentation error raised here? It looks like there's an indentation issue. Making the documentation work is finicky, so let me know if you get stuck on that and I'll fix it myself.

@hvarfner
Copy link
Contributor Author

@esantorella Thanks! I want to think that it's fixed now, except for

paths: SamplePath,
where importing SamplePath causes a circular import. As such, SamplePath remains undefined which flake8 doesn't like.

Copy link
Contributor

@Balandat Balandat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

importing SamplePath causes a circular import. As such, SamplePath remains undefined which flake8 doesn't like.

You can get around this with the TYPE_CHECKING var from the typing module, made comments in the code.

hvarfner and others added 7 commits April 13, 2023 22:19
variable names consistent with MES/GIBBON. Input constructors for JES &
JES-exploit added, as well as optimizer of pathwise posterior samples &
utility method for retrieving optima.
Co-authored-by: Max Balandat <Balandat@users.noreply.github.com>
@facebook-github-bot
Copy link
Contributor

@esantorella has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

hvarfner added 10 commits April 18, 2023 15:21
in the example to better highlight differences
Changed .forward() to call in optimize_posterior_samples
- Acqf.constructor
- Sample optimization
- Re-writing of Single-obj JES
- Tests
- Notebook fitting and changed examples for illustration's sake
optimize_poserior_samples - was accidentally left out of previous
commit.
@Balandat
Copy link
Contributor

@hvarfner looks like there are a couple of outstanding docs / lint issues. Do you think you can clean that up soon so that we can merge this in?

@esantorella
Copy link
Member

@hvarfner looks like there are a couple of outstanding docs / lint issues. Do you think you can clean that up soon so that we can merge this in?

These look like merge issues. They were fixed and got un-fixed after merging -- the good news is there is a version that was working.

@hvarfner
Copy link
Contributor Author

hvarfner commented Apr 22, 2023

@Balandat @esantorella Sorry, I thought I had it figured out last time around. I don't quite understand how I address the last little bit, so I'd appreciate any pointers you may have.

I'm passing tests locally at this point (although flake8 suggests some 40-odd files to be changed) so I assume I'm doing at least one thing incorrectly.

@hvarfner hvarfner requested a review from esantorella April 22, 2023 09:08
Copy link
Contributor

@Balandat Balandat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments that should fix the outstanding issues. There are also a couple of lines that aren't covered by tests, please make sure to get those as well. Thanks!

@Balandat
Copy link
Contributor

Looks like we're all good now. @esantorella could you merge this in, please?

@facebook-github-bot
Copy link
Contributor

@esantorella has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@esantorella merged this pull request in 04e235d.

@Balandat
Copy link
Contributor

Thanks a lot for your contribution, @hvarfner!

@hvarfner
Copy link
Contributor Author

@Balandat @esantorella My pleasure, thanks for your patience! =)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed Do not delete this pull request or issue due to inactivity. Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants