-
Notifications
You must be signed in to change notification settings - Fork 415
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added single-objective JES with additional conditioning, removal of MO-specific code / additional info-theoretic utils #1738
Conversation
Hi @hvarfner! Thank you for your pull request and welcome to our community. Action RequiredIn order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks! |
Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks! |
Hi @hvarfner. Thanks for the PR! I'll try to get this reviewed within the next week. In the mean time: i) We require 100% unit test coverage for BoTorch, so you can work on those. ii) The |
@saitcakmak Thanks! The .py file in tutorials wasn't supposed to be in there. I'll work on the tests in the coming days! |
botorch/utils/sampling.py
Outdated
for iter_ in range(maxiter): | ||
# Decaying the learing rate - empirically, this has been well-calibrated | ||
lr = lr_init / (iter_ + 1) | ||
per_sample_outputs = path_func(X_top_candidates) | ||
grads = torch.autograd.grad( | ||
per_sample_outputs, X_top_candidates, grad_outputs=torch.ones_like(per_sample_outputs))[0] | ||
X_top_candidates = torch.clamp( | ||
X_top_candidates + lr * grads, bounds[0], bounds[1]) # TODO fix bounds here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sample paths here are deterministic (and smooth) functions - rather using a (manual) first-order gradient scheme here, we could probably get much faster convergence by using a (quasi)-second order method such as L-BFGS here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, will fix!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Balandat After giving this some more thought, it seems pretty tricky to implement this in batch with the set of tools available. Would a for-loop solution suffice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like a for loop across each sample would be quite costly. Another option here would be to just define a function that sums the per_sample_outputs
and then call L-BFGS-B on that (with the domain being k
copies of the original domain, with k
the number of sample paths). This is actually what we do for performing acquisition function optimization across random restarts:
botorch/botorch/generation/gen.py
Lines 175 to 203 in 9302055
def f_np_wrapper(x: np.ndarray, f: Callable): | |
"""Given a torch callable, compute value + grad given a numpy array.""" | |
if np.isnan(x).any(): | |
raise RuntimeError( | |
f"{np.isnan(x).sum()} elements of the {x.size} element array " | |
f"`x` are NaN." | |
) | |
X = ( | |
torch.from_numpy(x) | |
.to(initial_conditions) | |
.view(shapeX) | |
.contiguous() | |
.requires_grad_(True) | |
) | |
X_fix = fix_features(X, fixed_features=fixed_features) | |
loss = f(X_fix).sum() | |
# compute gradient w.r.t. the inputs (does not accumulate in leaves) | |
gradf = _arrayify(torch.autograd.grad(loss, X)[0].contiguous().view(-1)) | |
if np.isnan(gradf).any(): | |
msg = ( | |
f"{np.isnan(gradf).sum()} elements of the {x.size} element " | |
"gradient array `gradf` are NaN. " | |
"This often indicates numerical issues." | |
) | |
if initial_conditions.dtype != torch.double: | |
msg += " Consider using `dtype=torch.double`." | |
raise RuntimeError(msg) | |
fval = loss.item() | |
return fval, gradf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I did try it some time ago with the torch L-BFGS but I must have done something wrong if it indeed works here. I'll explore it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a super thorough review, my main comment is about using a new acquisition function class to implement something generic like epsilon-greedy.
Can you also make sure to run black
on this?
|
||
class qExploitLowerBoundJointEntropySearch(qLowerBoundJointEntropySearch): | ||
r"""The acquisition function for the Joint Entropy Search, where the batches | ||
`q > 1` are supported through the lower bound formulation. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really like the pattern of defining a whole new acquisition function class for the sole purpose of doing some epsilon-greedy optimization. It seems like this pattern would immediately double the number of acquisition functions if we wanted to do this more generally. I think this sort of notion of how to deal with model misspecification should live at a higher level and the acquisition functions themselves should not be aware of that. For that reason I would prefer we don't add this qExploitLowerBoundJointEntropySearch
acquisition function.
Is the main reason you're defining this here so that it's easily useable in Ax via the acquisition function constructor? On the botorch side, this kind of epsilon greedy strategy can easily be implemented by slightly modifying the optimization loop. If this is challenging to do on the Ax side we should make changes there to make that easier (which we're already thinking about - cc @saitcakmak).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's for Ax, and I agree. Happy to remove. A middle-ground is to have it as an input constructor & generic acquisition function wrapper GreedyAcquisitionFunction(ActualAcquisitionFunction)
otherwise.
I guess that's a separate PR nonetheless.
|
||
class qLowerBoundJointEntropySearch(qLowerBoundMultiObjectiveJointEntropySearch): | ||
|
||
class qLowerBoundJointEntropySearch(AcquisitionFunction, MCSamplerMixin): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice, glad we're not making the single-objective acqf a subclass of the multi-objective one any more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though it seems a bit odd that qLowerBoundJointEntropySearch
is really only the right name if the estimation_type
is indeed LB
and not MC
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, changing.
condition_noiseless: | ||
posterior_transform: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO
# inputs come as num_optima_per_model x num_models x d | ||
# but we want it four-dimensional in the Fully bayesian case, | ||
# and three-dimensional otherwise (optionally remove, then re-add) | ||
self.optimal_inputs = optimal_inputs.squeeze(1).unsqueeze(-2) | ||
self.optimal_outputs = optimal_outputs.squeeze(1).unsqueeze(-2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I fully understand what's going on here, can you explain this a bit more? In the non-fully-bayesian setting, I assume num_models
is not a dimension, right? In which case the shape would be num_optima_per_model x 1 x d
after this logic, but for the fully bayesian model it would be num_optima_per_model x num_models x 1 x d
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see this is mentioned below. Would be good to have the comment here though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure! Some squeezes/unsqueezes cancelled out, so this should not be as confusing anymore.
botorch/utils/sampling.py
Outdated
|
||
per_sample_outputs = path_func(X_top_candidates).reshape( | ||
num_optima * batch_size, num_restarts) | ||
sample_argmax = torch.max(per_sample_outputs, dim=-1, keepdims=True)[1].flatten() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sample_argmax = torch.max(per_sample_outputs, dim=-1, keepdims=True)[1].flatten() | |
sample_argmax = torch.max(per_sample_outputs, dim=-1, keepdims=True).indices.flatten() |
botorch/utils/sampling.py
Outdated
which acts as extra initial guesses for the optimization routine. | ||
raw_samples: The number of samples with which to query the samples initially. | ||
num_restarts The number of gradient descent steps to use on each of the best | ||
found initial points. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
found initial points. | |
found initial points. |
botorch/utils/sampling.py
Outdated
argtop_candidates = candidate_queries.argsort(dim=-1, descending=True)[ | ||
..., 0:num_restarts] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not use topk
?
botorch/utils/sampling.py
Outdated
..., 0:num_restarts] | ||
|
||
# These are used as masks when retrieving the argmaxes | ||
row_indexer = torch.arange(num_optima * batch_size).to(torch.long) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be a nullop
row_indexer = torch.arange(num_optima * batch_size).to(torch.long) | |
row_indexer = torch.arange(num_optima * batch_size) |
botorch/utils/sampling.py
Outdated
for iter_ in range(maxiter): | ||
# Decaying the learing rate - empirically, this has been well-calibrated | ||
lr = lr_init / (iter_ + 1) | ||
per_sample_outputs = path_func(X_top_candidates) | ||
grads = torch.autograd.grad( | ||
per_sample_outputs, X_top_candidates, grad_outputs=torch.ones_like(per_sample_outputs))[0] | ||
X_top_candidates = torch.clamp( | ||
X_top_candidates + lr * grads, bounds[0], bounds[1]) # TODO fix bounds here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like a for loop across each sample would be quite costly. Another option here would be to just define a function that sums the per_sample_outputs
and then call L-BFGS-B on that (with the domain being k
copies of the original domain, with k
the number of sample paths). This is actually what we do for performing acquisition function optimization across random restarts:
botorch/botorch/generation/gen.py
Lines 175 to 203 in 9302055
def f_np_wrapper(x: np.ndarray, f: Callable): | |
"""Given a torch callable, compute value + grad given a numpy array.""" | |
if np.isnan(x).any(): | |
raise RuntimeError( | |
f"{np.isnan(x).sum()} elements of the {x.size} element array " | |
f"`x` are NaN." | |
) | |
X = ( | |
torch.from_numpy(x) | |
.to(initial_conditions) | |
.view(shapeX) | |
.contiguous() | |
.requires_grad_(True) | |
) | |
X_fix = fix_features(X, fixed_features=fixed_features) | |
loss = f(X_fix).sum() | |
# compute gradient w.r.t. the inputs (does not accumulate in leaves) | |
gradf = _arrayify(torch.autograd.grad(loss, X)[0].contiguous().view(-1)) | |
if np.isnan(gradf).any(): | |
msg = ( | |
f"{np.isnan(gradf).sum()} elements of the {x.size} element " | |
"gradient array `gradf` are NaN. " | |
"This often indicates numerical issues." | |
) | |
if initial_conditions.dtype != torch.double: | |
msg += " Consider using `dtype=torch.double`." | |
raise RuntimeError(msg) | |
fval = loss.item() | |
return fval, gradf |
@saitcakmak @Balandat There! I hope that it is all in order now. The support for fully Bayesian treatment is there as well, but it doesn't work as of now since the support isn't there. I'll happily keep an eye out to see if everything works as intended once the GPyTorch aspect of it is in place. |
@esantorella has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Codecov Report
@@ Coverage Diff @@
## main #1738 +/- ##
==========================================
Coverage 100.00% 100.00%
==========================================
Files 170 170
Lines 14783 14912 +129
==========================================
+ Hits 14783 14912 +129
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! A number of largely nits. I'll take another look at the gpytorch PR for allowing conditioning on samples also in the batched model (fully Bayesian) case in the next few days.
cdf_mvs = normal.cdf(normalized_mvs).clamp_min(CLAMP_LB) | ||
cdf_rescaled_mvs = normal.cdf(mvs_rescaled_mc).clamp_min(CLAMP_LB) | ||
mv_ratio = cdf_rescaled_mvs / cdf_mvs | ||
|
||
log_term = torch.log(mv_ratio) + conditional_logprobs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm it would be good to just work with logcdfs from the get go here for numerical stability, but unfortunately I don't think torch has great support for those
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like my past self noticed that as well: pytorch/pytorch#52973
For the time being, does your code just throw an error? If so, maybe you can add an informative |
That makes sense, I added the error as well. I realize that I forgot to mention a specific technical detail previously: For each of M models, we'd want to condition on an additional N optima (each of which yields a new GP). As such, the (entire) conditional model would have M x N outputs. This works fine for a regular The expected behavior in the FB case would then be to have Happy to contribute any which way to the FB issue. |
After benchmarking a bit more, the method for obtaining posterior samples (which now uses Personally, I'd like to find adress this issue, either in this PR or in a subsequent one. I think the best action for now is to just go with |
@esantorella has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Hi, thanks for the pull request! A couple things:
|
@esantorella Thanks! I want to think that it's fixed now, except for botorch/botorch/utils/sampling.py Line 880 in 0fb26d2
SamplePath causes a circular import. As such, SamplePath remains undefined which flake8 doesn't like.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
importing SamplePath causes a circular import. As such, SamplePath remains undefined which flake8 doesn't like.
You can get around this with the TYPE_CHECKING
var from the typing
module, made comments in the code.
variable names consistent with MES/GIBBON. Input constructors for JES & JES-exploit added, as well as optimizer of pathwise posterior samples & utility method for retrieving optima.
Co-authored-by: Max Balandat <Balandat@users.noreply.github.com>
@esantorella has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
in the example to better highlight differences Changed .forward() to call in optimize_posterior_samples
- Acqf.constructor - Sample optimization - Re-writing of Single-obj JES - Tests - Notebook fitting and changed examples for illustration's sake
optimize_poserior_samples - was accidentally left out of previous commit.
@hvarfner looks like there are a couple of outstanding docs / lint issues. Do you think you can clean that up soon so that we can merge this in? |
These look like merge issues. They were fixed and got un-fixed after merging -- the good news is there is a version that was working. |
@Balandat @esantorella Sorry, I thought I had it figured out last time around. I don't quite understand how I address the last little bit, so I'd appreciate any pointers you may have. I'm passing tests locally at this point (although |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments that should fix the outstanding issues. There are also a couple of lines that aren't covered by tests, please make sure to get those as well. Thanks!
Looks like we're all good now. @esantorella could you merge this in, please? |
@esantorella has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@esantorella merged this pull request in 04e235d. |
Thanks a lot for your contribution, @hvarfner! |
@Balandat @esantorella My pleasure, thanks for your patience! =) |
variable names consistent with MES/GIBBON. Input constructors for JES & JES-exploit added, as well as optimizer of pathwise posterior samples & utility method for retrieving optima.
Motivation
Refactoring of single-objective JES - larger codebase, but slightly faster, better support and more intuitive (?)
There is a substantial amount of multi-objective specific computation in single-objective JES. At its core, it's relatively similar to MES/GIBBON, and this suggested change tries to reflect this. It runs faster than previously (due to the removal of MO-specific stuff), and is (arguably) a bit more intuitive. Moreover, there is the option of running the two different conditioning variants (noisy or noiseless optimal values). The performance diff, however, is minor.
Also added the exploit-variant of JES for added robustness in cases of model misspecification.
Optimization of samples from the posterior
Added batch optimization of pathwise samples and utility method in acquisition/utils. Hyperparameters are set so as to (anecdotally/empirically) not run into memory or runtime issues on laptop, but still be peak performant.
Input constructors for JES
Support for a Fully Bayesian variant of JES - didn't quite work previously (given that condition_on_observations works with fully bayesian models - see incoming PR!
Have you read the Contributing Guidelines on pull requests?
Yes!
Test Plan
Have tested the related notebooks and Ax interfaces & performance of the two different JES methods.
Related PRs
Yes, proposed fix to FBGPs. Coming soon.