Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add entropy search acquisition functions #1458

Closed
wants to merge 22 commits into from

Conversation

benmltu
Copy link
Contributor

@benmltu benmltu commented Oct 19, 2022

Hi,

I have provided some implementations of entropy search acquisition functions used in a recent paper (https://arxiv.org/abs/2210.02905). This PR includes the acquisition function and the necessary utilities. I have included a notebook that describes how to use these methods.

I was not sure what were the best places to add these acquisition functions, so I put them all in the multi-objective folder. Nevertheless, they should work for single-objective optimization as well.

Thanks,
Ben

@facebook-github-bot facebook-github-bot added the CLA Signed Do not delete this pull request or issue due to inactivity. label Oct 19, 2022
@sdaulton
Copy link
Contributor

Thanks for putting this up. I will review shortly. Note that BoTorch requires 100% test coverage, so you will need to write unit tests before this PR can be merged in.

@codecov
Copy link

codecov bot commented Oct 19, 2022

Codecov Report

Merging #1458 (707dacd) into main (706e2a1) will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##              main     #1458    +/-   ##
==========================================
  Coverage   100.00%   100.00%            
==========================================
  Files          149       153     +4     
  Lines        12936     13567   +631     
==========================================
+ Hits         12936     13567   +631     
Impacted Files Coverage Δ
botorch/acquisition/joint_entropy_search.py 100.00% <100.00%> (ø)
...cquisition/multi_objective/joint_entropy_search.py 100.00% <100.00%> (ø)
...sition/multi_objective/max_value_entropy_search.py 100.00% <100.00%> (ø)
...ition/multi_objective/predictive_entropy_search.py 100.00% <100.00%> (ø)
botorch/acquisition/multi_objective/utils.py 100.00% <100.00%> (ø)
botorch/acquisition/predictive_entropy_search.py 100.00% <100.00%> (ø)
botorch/generation/gen.py 100.00% <100.00%> (ø)

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@Balandat Balandat requested a review from sdaulton October 19, 2022 15:37
@sdaulton
Copy link
Contributor

Regarding testing, my comment was not intended to be criticizing. I am super excited about your PR for these AFs. The code looks quite clean. I just wanted to flag the need for unit tests since that will likely be a fair bit of work. There are many examples of testing other AFs, e.g. https://github.com/pytorch/botorch/blob/main/test/acquisition/multi_objective/test_monte_carlo.py. Let me know if you have any questions on testing. Thanks again for the PR!

Copy link
Contributor

@sdaulton sdaulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting this up. This is a partial initial pass (I didn't make it all the way through the PR).

One high level comment is that we should make sure that this is dtype/device agnostic, so that we can run it on a GPU--the main blocker at the moment is initializing tensors (e.g. torch.ones()) without specifying dtype and device. Typically these can be fetched from some other tensor (e.g. X).

Another comment is that much of the code is shared between JES and MES (for example in __init__. It would be good to consolidate this into one common base class.

Also, it would be nice to highlight in the docstrings what functions are differentiable/not for easy of use.

hypercell_bounds: Tensor,
maximize: bool = True,
X_pending: Optional[Tensor] = None,
estimation_type: str = "Lower bound",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A python enum might be nice for listing the options and removing the risk of string typos.

maximize: bool = True,
X_pending: Optional[Tensor] = None,
estimation_type: str = "Lower bound",
only_diagonal: Optional[bool] = False,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
only_diagonal: Optional[bool] = False,
only_diagonal: bool = False,

Comment on lines 213 to 214
num_samples: Optional[int] = 64,
num_constraints: Optional[int] = 0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
num_samples: Optional[int] = 64,
num_constraints: Optional[int] = 0,
num_samples: int = 64,
num_constraints: int = 0,

dominated space union the infeasible space.
maximize: If True, consider the problem a maximization problem.
X_pending: A `m x d`-dim Tensor of `m` design points that have been
submitted for function evaluation but have not yet been evaluated.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
submitted for function evaluation but have not yet been evaluated.
submitted for function evaluation, but have not yet been evaluated.

estimation_type: A string to determine which entropy estimate is
computed: "Noiseless", "Noiseless lower bound", "Lower bound" or
"Monte Carlo".
only_diagonal: If true we only compute the diagonal elements of the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
only_diagonal: If true we only compute the diagonal elements of the
only_diagonal: If true, we only compute the diagonal elements of the

verbose: Optional[bool] = False,
**kwargs: Any,
) -> None:
r"""Predictive entropy search acquisition function.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is code is pretty dense. Thanks for documenting it so well. Easy to see why there are so few implementation of this...

Returns
A `x_shape`-dim Tensor.
"""
normal = Normal(torch.zeros_like(x), torch.ones_like(x))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dtype/device



def _initialize_predictive_matrices(
X: Tensor,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra indent

Comment on lines 665 to 667
observation_noise: Optional[bool] = True,
jitter: Optional[float] = 1e-4,
natural: Optional[bool] = True,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove optional

observation_noise: If true the posterior is computed with observation noise.
jitter: The jitter added to the covariance matrix.
natural: If true we compute the natural statistics as well.
Return:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably need to follow this format for the autodocs:

Returns:
A two-element tuple containing
- a `(num_restarts) x q x d`-dim tensor of generated candidates.
- a tensor of associated acquisition values. If `sequential=False`,
this is a `(num_restarts)`-dim tensor of joint acquisition values
(with explicit restart dimension if `return_best_only=False`). If
`sequential=True`, this is a `q`-dim tensor of expected acquisition
values conditional on having observed candidates `0,1,...,i-1`.

@eytan
Copy link
Contributor

eytan commented Oct 22, 2022

@benmltu , thanks so much for the PR—the code here is very high quality, as are these ES implementations. This will be super valuable to the community. I am also looking forward to seeing if there are ways to further accelerate JES through GPU-based computation once some of these type issues have been resolved.

@hvarfner
Copy link
Contributor

I figured I'd just chime in here. I'm also extremely grateful for this your job, thanks a lot.

I've went through the code, tried to run it and to implement my own stuff. As far as plugging in different versions of JES, our differences are very minor and can easily be accounted for within the current structure (and probably not even two different classes).

Otherwise, I only have a couple of comments:

  • the sampling_noise argument is unused.
  • pymoo 0.6.0 does not seem to support the pareto sampling of optima (0.5.0 was fine)
  • Single-objective MES has its sampling of maxima inside the acquisition function, while these implementations have the optima as arguments to be passed to the acquisition function. I personally prefer your approach, since one may want to experiment with sampling procedures quite a bit.

Once again, thanks a lot!

@benmltu
Copy link
Contributor Author

benmltu commented Nov 4, 2022

Thanks for all the comments, I've updated the code to address many of the issues. Main changes:

  • Unit tests are now included and everything should be dtype/device agnostic.
  • JES and MES have been refactored to share a common base class.
  • The acquisition functions for the multi-objective and single-objective PES and JES have been separated. I didn't do this for MES because there are already exist many different single-objective implementations. From inspection, the existing implementations do more of less same thing with some minor differences arising in how the max-values are set and how batches are handled.
  • The code has been updated to pymoo 0.6.0.
  • The descriptions of the methods has been updated---there were some artefacts from earlier versions that was noted in the comments.
  • Constraint handling for PES has been removed. The EP updates relating to the inequality constraints appear to be unstable in some settings, it is hard to see how to fix this one without going through all the equations again.

Copy link
Contributor

@Balandat Balandat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent, thanks so much for the contribution and the many updates in response to the review. This is really great to see.

One high-level comment is that In general we're trying to keep the dependencies of botorch as lightweight as possible, so it would be great if we could make pymoo an optional dependency. One way to do that would be to do something like

try: 
    import pymoo
    
    PYMOO_INSTALLED = True
except ImportError:
    PYMOO_INSTALLED = False

at the beginning of the sample_pareto.py module, move the imports from pymoo into the functions inside the module, and if PYMOO_INSTALLED is False just raise an informative exception prompting the user to install pymoo if they would like to the newly added entropy search acquisition functions.

@facebook-github-bot
Copy link
Contributor

@esantorella has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@esantorella
Copy link
Member

@benmltu FYI, I merged the "main" branch into this one, so please do a "git pull" before doing any local development. I also imported it into an internal Meta repo to see if internal tools catch any issues.

Copy link
Contributor

@sdaulton sdaulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates and adding tests!

@@ -161,3 +167,121 @@ def prune_inferior_points_multi_objective(
effective_n_w = obj_vals.shape[-2] // X.shape[-2]
idcs = (idcs / effective_n_w).long().unique()
return X[idcs]


def compute_sample_box_decomposition(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function could be greatly simplified by using a BoxDecompositionList, which already pads the box decompositions to make sure all box decompositions in the list have the same number of boxes:

# pad the decomposition with empty cells so that all
# decompositions have the same number of cells
. Let's use BoxDecompositionList to deduplicate that logic. If this helper is still needed, let's make sure to highlight in the docstring that this is used for entropy-based AFs.

Comment on lines 44 to 46
num_generations: int = 100,
pop_size: int = 100,
num_offsprings: int = 10,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm are these good defaults? 10 offsprings seems quite small. Why not use the pymoo defaults?

) -> Tuple[Tensor, Tensor]:
r"""Computes the Pareto optimal set and front from samples of the GP.

(i) Samples are generated using random Fourier features.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This need not be RFFs. It could also use decoupled sampling. Let's add a TODO to generalize the GP sampling bit

M = model.num_outputs
d = bounds.shape[-1]

if M == 1:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that this function should simply raise an exception if M==1.

Comment on lines 219 to 223
"Only found "
+ str(num_pareto_generated)
+ " Pareto efficient points instead of "
+ str(num_pareto_points)
+ "."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's use f-strings

up = hypercell_bounds[1].unsqueeze(0)

# Compute the sample hypervolume improvement.
hvi = (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like this would be handy to have as an instance method of BoxDecompositions (e.g. a method to compute the incremental HVI of a new point --- or more generally (batch_shape x) 1 x M new points)

# LICENSE file in the root directory of this source tree.

r"""
Some methods for sampling the Pareto optimal points. This relies on the pymoo
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like this should live in botorch/utils/multi_objective/sample_pareto.py

optimal_outputs: A `num_samples x 1`-dim Tensor containing the optimal
set of objectives of dimension `1`.
maximize: If true, we consider a maximization problem.
hypercell_bounds: A `num_samples x 2 x J x 1`-dim Tensor containing the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm these calling these hypercell bounds is odd, since they are not hyperrectangles, but rather lines... I guess it makes more sense in the constrained case

from torch import Tensor


class qLowerBoundJointEntropySearch(qLowerBoundMultiObjectiveJointEntropySearch):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's seems kind of odd to have the single objective version inherit from the multi-objective version (just an observation, not a criticism)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I wasn't too keen on this but it was a bit involved doing it the other way round, especially for PES

# "Fantasy observations can only be added after making predictions with a
# model so that all test independent caches exist."

if condition_on_samples:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is specific for qLowerBoundMultiObjectiveJointEntropySearch? If so, let's move it into qLowerBoundMultiObjectiveJointEntropySearch

@benmltu
Copy link
Contributor Author

benmltu commented Nov 14, 2022

I've added a small update now which removes the pymoo optimizer and replaces it with a basic random search optimizer.

@facebook-github-bot
Copy link
Contributor

@esantorella has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@eytan
Copy link
Contributor

eytan commented Nov 14, 2022

I've added a small update now which removes the pymoo optimizer and replaces it with a basic random search optimizer.

How does this affect the performance of the algorithm (both in terms of anytime optimization performance and runtime?)

@benmltu
Copy link
Contributor Author

benmltu commented Nov 15, 2022

I've added a small update now which removes the pymoo optimizer and replaces it with a basic random search optimizer.

How does this affect the performance of the algorithm (both in terms of anytime optimization performance and runtime?)

The performance of these entropy search methods are definitely sensitive to how the samples are optimized---“better” samples lead to better results. From some tests I ran this morning, I see a slight drop in performance from using the random search but a definite improvement in run time. The cost of pymoo on my machine is around an additional 2secs per iteration.
jes_lb_perf
mes_lb_perf
walltimes

fs = old_factor.shape

df = damping_factor
for i in range(len(fs[len(bs) :])):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for i in range(len(fs[len(bs) :])):
for _ in range(len(fs[len(bs) :])):

Addressing a complaint from an internal linter

] = random_search_optimizer,
num_rff_features: int = 512,
maximize: bool = True,
optimizer_kwargs: dict = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
optimizer_kwargs: dict = None,
optimizer_kwargs: Optional[Dict[str, Any]] = None,

addressing a complaint from a Meta internal linter

Comment on lines 355 to 373
posterior_statistics = dict()
# Compute the prior entropy term depending on `X`.
initial_posterior_plus_noise = self.initial_model.posterior(
X, observation_noise=True
)

# Additional constant term.
add_term = (
0.5
* self.model.num_outputs
* (1 + torch.log(2 * pi * torch.ones(1, **tkwargs)))
)
# The variance initially has shape `batch_shape x (q*M) x (q*M)`
# prior_entropy has shape `batch_shape`.
initial_entropy = add_term + 0.5 * torch.logdet(
initial_posterior_plus_noise.mvn.covariance_matrix
)

posterior_statistics["initial_entropy"] = initial_entropy
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
posterior_statistics = dict()
# Compute the prior entropy term depending on `X`.
initial_posterior_plus_noise = self.initial_model.posterior(
X, observation_noise=True
)
# Additional constant term.
add_term = (
0.5
* self.model.num_outputs
* (1 + torch.log(2 * pi * torch.ones(1, **tkwargs)))
)
# The variance initially has shape `batch_shape x (q*M) x (q*M)`
# prior_entropy has shape `batch_shape`.
initial_entropy = add_term + 0.5 * torch.logdet(
initial_posterior_plus_noise.mvn.covariance_matrix
)
posterior_statistics["initial_entropy"] = initial_entropy
# Compute the prior entropy term depending on `X`.
initial_posterior_plus_noise = self.initial_model.posterior(
X, observation_noise=True
)
# Additional constant term.
add_term = (
0.5
* self.model.num_outputs
* (1 + torch.log(2 * pi * torch.ones(1, **tkwargs)))
)
# The variance initially has shape `batch_shape x (q*M) x (q*M)`
# prior_entropy has shape `batch_shape`.
initial_entropy = add_term + 0.5 * torch.logdet(
initial_posterior_plus_noise.mvn.covariance_matrix
)
posterior_statistics = {"initial_entropy": initial_entropy}

A Meta-internal linter complains about the dict() call, and also let's define posterior_statistics closer to where it's used

tkwargs = {"dtype": X.dtype, "device": X.device}
CLAMP_LB = torch.finfo(tkwargs["dtype"]).eps

posterior_statistics = dict()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
posterior_statistics = dict()

see below

initial_entropy = add_term + 0.5 * torch.logdet(
posterior_plus_noise.mvn.covariance_matrix
)
posterior_statistics["initial_entropy"] = initial_entropy
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
posterior_statistics["initial_entropy"] = initial_entropy
posterior_statistics = {"initial_entropy": initial_entropy}

A Meta-internal linter complains about the dict() call, and also let's define posterior_statistics closer to where it's used

"For single-objective optimization `num_points` should be 1."
)
if optimizer_kwargs is None:
optimizer_kwargs = dict()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
optimizer_kwargs = dict()
optimizer_kwargs = {}

A Meta-internal linter prefers this (C408 from flake8-comprehensions, if anyone is curious https://pypi.org/project/flake8-comprehensions/)

@Balandat
Copy link
Contributor

The performance of these entropy search methods are definitely sensitive to how the samples are optimized---“better” samples lead to better results. From some tests I ran this morning, I see a slight drop in performance from using the random search but a definite improvement in run time. The cost of pymoo on my machine is around an additional 2secs per iteration.

Thanks a lot for adding these benchmarks! I think it's fine to go ahead with the random search solution for now (but add a TODO in code to improve this) so we can get this merged in soon. Then we can figure out the pymoo dependency (or other approaches) in a follow-up.

There is also this tutorial error in the CI:

RuntimeError: The following tutorial files are present, but are not linked on the website: information_theoretic_acquisition_functions.ipynb.

You can address this by adding the tutorial to this file: https://github.com/pytorch/botorch/blob/main/website/tutorials.json

@Balandat
Copy link
Contributor

Looks like the recent large scale refactors on our end caused some merge conflicts - once those are resolve this should be able to go in!

@sdaulton
Copy link
Contributor

sdaulton commented Nov 22, 2022

I ran the tutorial and noticed the batch AF values are often negative (printed below with no modification to the tutorial). What causes this? Is this because the approximation of the batch information gain is a loose lower bound?

The 4 AF values printed below for MES and JES are the joint AF value of candidates 1,..., j for j=1,..., q=4 (obtained via sequential greedy optimization)

Screen Shot 2022-11-22 at 4 09 21 PM

@benmltu
Copy link
Contributor Author

benmltu commented Nov 23, 2022

I ran the tutorial and noticed the batch AF values are often negative (printed below with no modification to the tutorial). What causes this? Is this because the approximation of the batch information gain is a loose lower bound?

The 4 AF values printed below for MES and JES are the joint AF value of candidates 1,..., j for j=1,..., q=4 (obtained via sequential greedy optimization)

Screen Shot 2022-11-22 at 4 09 21 PM

That's right, the upper bound H[p(A_1,...,A_q)] <= H[p(A_1)] + ... + H[p(A_q)] can be loose because it essentially ignores the correlation between the A_i's. As q increases, the upper bound becomes larger so the overall lower bound acq can become negative.

@sdaulton
Copy link
Contributor

sdaulton commented Nov 23, 2022

Interesting. It's kind of funky 1) have the approximate joint utility of the optimal q-batch of points be worse (and monotonically decreasing it appears) as q increases (as new points are sequentially added to the batch), 2) have the optimal approximate information gain be negative.

Neither of these would be the case the prior joint entropy used the same approximation. Is that something you considered?

It might be worth noting in the docstring that negative (and decreasing) values can occur due to the loose upper bound in the batch approx.

Copy link
Contributor

@sdaulton sdaulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again for adding these! @esantorella, let's re-import and make sure there are no remaining issues

@benmltu
Copy link
Contributor Author

benmltu commented Nov 24, 2022

Neither of these would be the case the prior joint entropy used the same approximation. Is that something you considered?

Yeah this was something that we considered early on, the acquisition function reduces down to something like: qJES(X) = JES(x_1) + JES(x_2) + ... + JES(x_q). If you do batch optimization (not sequential) on this, then the solution is typically concentrated at one point i.e. X*=(x*,...,x*). To improve on this, we tried sequential optimization where we also conditioned on the pending points. That worked but it can be quite expensive. It turned out to be cheaper just to keep the initial joint entropy term in order to encourage the diversity in the batch.

@facebook-github-bot
Copy link
Contributor

@esantorella has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@esantorella
Copy link
Member

This has been landed! Thanks so much @benmltu !

@Balandat
Copy link
Contributor

Balandat commented Dec 2, 2022

This is amazing! Thank you so much for your excellent contribution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed Do not delete this pull request or issue due to inactivity.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants