Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kernels working on a given set of features #476

Open
wants to merge 39 commits into
base: main
Choose a base branch
from

Conversation

e-dorigatti
Copy link
Collaborator

No description provided.

@e-dorigatti e-dorigatti linked an issue Dec 4, 2024 that may be closed by this pull request
@e-dorigatti
Copy link
Collaborator Author

e-dorigatti commented Dec 4, 2024

                SingleTaskGPSurrogate(
                    # inputs= ...,
                    # outputs= ...,
                    kernel=AdditiveKernel(
                        kernels=[
                            RBFKernel(
                                ard=True,
                                lengthscale_prior=HVARFNER_LENGTHSCALE_PRIOR(),
                                features=["x1", "x2"],
                            ),
                            RBFKernel(
                                ard=True,
                                lengthscale_prior=HVARFNER_LENGTHSCALE_PRIOR(),
                                features=["x3", "x4", "x5"],
                            ),
                        ]
                    ),
                )

TODOs:

  • tests
  • validator (make sure the features actually are in the inputs)

@e-dorigatti
Copy link
Collaborator Author

e-dorigatti commented Dec 4, 2024

Example with mol features:

import pandas as pd

import bofire.strategies.api as strategies
from bofire.data_models.domain import api as domain_api
from bofire.data_models.features import api as features_api
from bofire.data_models.kernels import api as kernels_api
from bofire.data_models.molfeatures import api as molfeatures_api
from bofire.data_models.priors.api import HVARFNER_LENGTHSCALE_PRIOR
from bofire.data_models.strategies import api as strategies_api
from bofire.data_models.surrogates import api as surrogates_api


domain = domain_api.Domain(
    inputs=domain_api.Inputs(
        features=[
            features_api.ContinuousInput(key="x1", bounds=(-1, 1)),
            features_api.ContinuousInput(key="x2", bounds=(-1, 1)),
            features_api.CategoricalMolecularInput(
                key="mol", categories=["CO", "CCO", "CCCO"]
            ),
        ]
    ),
    outputs=domain_api.Outputs(features=[features_api.ContinuousOutput(key="f")]),
)


strategy = strategies.map(
    strategies_api.SoboStrategy(
        domain=domain,
        surrogate_specs=surrogates_api.BotorchSurrogates(
            surrogates=[
                surrogates_api.SingleTaskGPSurrogate(
                    inputs=domain.inputs,
                    outputs=domain.outputs,
                    input_preprocessing_specs={
                        "mol": molfeatures_api.Fingerprints(),
                    },
                    kernel=kernels_api.AdditiveKernel(
                        kernels=[
                            kernels_api.RBFKernel(
                                ard=True,
                                lengthscale_prior=HVARFNER_LENGTHSCALE_PRIOR(),
                                features=["x1", "x2"],
                            ),
                            kernels_api.TanimotoKernel(
                                features=["mol"],
                            ),
                        ]
                    ),
                )
            ]
        ),
    )
)


strategy.tell(
    experiments=pd.DataFrame(
        [
            {"x1": 0.2, "x2": 0.4, "mol": "CO", "f": 1.0},
            {"x1": 0.4, "x2": 0.2, "mol": "CCO", "f": 2.0},
            {"x1": 0.6, "x2": 0.6, "mol": "CCCO", "f": 3.0},
        ]
    )
)
candidates = strategy.ask(candidate_count=1)
print(candidates)

#     x1   x2   mol    f_pred      f_sd     f_des
# 0  1.0  1.0  CCCO  2.932646  0.945029  2.932646

@jduerholt
Copy link
Contributor

@e-dorigatti: This looks really great and will make the GP section so much easier, we can then get rid of a lot of GP classes! Thank you very much!

@jduerholt
Copy link
Contributor

jduerholt commented Dec 6, 2024

I think, a specific challenge but also a good opportunity to test will be the MixedSingleTaskGPSurrogate it is based on botorchs MixedSingleTaskGP which assumes that every categorical features is one column where the categories are different numbers, then it uses this column to compute a hamming distance (https://github.com/experimental-design/bofire/blob/main/bofire/surrogates/mixed_single_task_gp.py). Since BoFire always assumes that categorical features enter as one-hot encoded, we used there a OneHotToNumeric input transform.

The question is how we deal with this? One option would be to code a new Hamming distance kernel which also works on one hot encodeded categoricals or to also start to model the input transforms on a feature specific level in the datamodel. This would then also solve the PiecewiselinearGP but would add more complexity but also versatility.

The mapping of the features for the input transforms could be done in the same way as you are currently doing it for the kernels.

What do you think? I have the feeling that exposing also the input transforms would be the cleanest solution. What is difficult here, is that the OneHotToNumeric transform changes the indices of features as it collapses for every cateogrical feature from n_categories to 1. So we have some problems here ;) Maybe we should then rewrite the kernel and expose the transforms :D

@e-dorigatti
Copy link
Collaborator Author

e-dorigatti commented Dec 6, 2024

Hi @jduerholt thanks for the kind feedback and pointers :) this sounds like a very important use-case to implement and in the next weeks I will look into it. My first idea would be to go via the input_preprocessing_specs and try to mimic what Fingerprints is doing, i.e., the onehot encoding in this case, probably by recycling some functions from CategoricalInput, and indeed creating a new kernel for the Hamming distance.

Going through an input transform sounds (a lot?) more involved, why do you think that would be preferable to the preprocessing specs? Maybe some kernel/features combinations are handled more efficiently in botorch? For example, it would make a lot of sense if a Hamming kernel did not need to actually expand the one-hot encoding. But this I could also implement myself. Maybe I need to learn more about botorch first :)

@jduerholt
Copy link
Contributor

Hi @e-dorigatti,

a few more notes/explanations from my side:

  • There are two types of transforms in principle:
    1. Everything that we have in the input_transform_specs, this is executed before the actual optimization/model fitting. For examples, categorical features are transformed/expanded into Onehots or molecular ones into fingerprints, than the model is fitted upon this representation and also the optimizer on top of the acquisition function is working on top of this represenation. Currently, we can give there either one of CategoricalEncodingEnums or one of the molfeatures classes. This is also a bit of inconsistent here, and we have to tidy it up at some point. The molfeatures classes was added later, and we found there that simple enums were not enough as the transform need to be parameterized.
    2. Then we have the botorch input transforms, this is executed on the fly during gp fitting, optimization. A transform(s) is added to the GP and the data is transformed before it is actually entering the kernel. We use them for their flexibility, for example we could have two models in an optimization and the one could be equipped with a Normalize transform and the other one with an InputStandardize transform. Then one GP would be fitted to the normalized (between 0 and 1) data and the other one to the standardized data (zero mean, unit varriance), and still both of them could be used in combination during optimization. Another example is filtering of features. It could be the case, that we have build two surrogates, but in one of them one feature is not present (perhaps, because we know that it has not predictive power for the output of this model) and now we want to use both surrogates in optimization. In this case, BoFire is automatically combining both Models in botorch's ModelList and adds a filter features transform before the first one that filters those features out that should be not part of the optimization (could be in principle also done with the approach here in the PR). Another example is the PiecewieseLinearGP. For the optimizer, the input to the model are the nodes of the piecewiselinear, but the nodes are on the fly duing optimization expanded (in a differentiable manner) to the discretized piecewiselinear on which then the Wasserstein Kernel is applied. Currently, desite chosing which scaler to use, botorch input transform are not explicitly exposed to BoFires API.
  • We decided at some point that, categoricals have to be encoded for use in surrogates/optimization always as one-hot, as this allows both for dealing with them as normal features treated as the continuous ones without special kernels and using them in a MixedGP botorch OneHotToNumeric transform. Of course, also other design choices would be possible here, which have drawbacks in itself.
  • Regarding this PR, an 80% solution which would make different implementations for and MixedSingleTaskGPMixedTanimotoGPSurrogate obsolete would be to have the HammingKernel being able to act directly on the onehot encoded values. This would mean that one would move the OneHotToNumeric transform into the kernel. Furthermore, also the MultiTaskGP could then benefit from this as we would then have the same flexibility there. Of course we could also lift the restriction of having categoricals always as one-hots.
  • In a next step, one should then make the already eposed scaling transforms (Normalize, Standardize) configurable by giving the user the explicit option which features these transform are acting on.

In general, botorch input transforms are a great way to build up complex differentiable optimization pipelines, on the other hand it is very challenging to come up with a general framework, parameterizied by feature names that can compose different input transforms on different features including the bookkeeping of indices.

I hope his posts helps ;)

@e-dorigatti
Copy link
Collaborator Author

Thank you for the thorough explanation @jduerholt, this clears up a lot of doubts! So the key difference is that the optimizer of the acquisition function is working with the features that result after applying the input preprocessing specs, but before the input transform.

What are the reasons that led you to decide that categoricals have to be one-hotted? Currently because I don't know your motivations it looks a bit silly since you use the input transform to do the one-hot, then the OneHotToNumeric to undo the one-hot, as the Hamming kernel does not need one-hots in the first place. If we now add more code to deal with the fact that this specific transform (and only this one?) changes the input dimensionality, it would seem like over-complicating things even further (but it would not be that complicated in the end). Maybe, given that with this PR you can have kernels focusing on specific (subsets of) features, it is now unnecessary to force categoricals to be one-hot? Moreover, because molecules are in the end a special type of categorical, it would make sense imho to convert the molecular featurizers to transforms rather than preprocessing.

I am now wondering whether it is necessary to expose botorch's input transforms, or it would be okay to put these transformations within the kernels themselves, so that users would specify the features that the kernel is acting on as well as how they should be transformed. I think this would be simpler to implement, and not less powerful than botorch's input transforms unless I am missing something. It also makes sense because, in some way, kernels are the only reason why we need input transforms in the first place, and different kernels could work on different representations of the same feature.

What do you think? In the end, I think that both options can be implemented relatively easily, but perhaps one is more disruptive than the other. I think you have a better perspective and ability to judge :)

@jduerholt
Copy link
Contributor

Will provide you a thorough answer tomorrow ;)

@jduerholt
Copy link
Contributor

jduerholt commented Dec 12, 2024

@e-dorigatti :

What are the reasons that led you to decide that categoricals have to be one-hotted? Currently because I don't know your motivations it looks a bit silly since you use the input transform to do the one-hot, then the OneHotToNumeric to undo the one-hot, as the Hamming kernel does not need one-hots in the first place.

The reason was that I wanted to have things too flexible:

  • I wanted to be able to combine a SingleTaskGPSurrogate which needs to onehot the categoricals with a MixedSingleTaskGPSurrogate in the same optimization. For this, there is the need that for both of them the same features enter the GP. So I decided for this complicated solution using the OneHotToNumeric transform, as I saw that it was already available in botorch. And then I just said that CategoricalnputFeatures always has to be one hot.
  • But this kind of constraint is already relaxed for the following features which all inherit from CategoricalInputFeature:
    • CategoricalDescriptorInput: Here also CategoricalEncodingEnum.DESCRIPTOR is allowed.
    • CategoricalMolecularInput: Here also the vairous mofeature transforms are allowed.
    • TaskInput: If you want to use it within a MultiTaskGPSurrogate, it has to be ordinal encoded.

Moreover, because molecules are in the end a special type of categorical, it would make sense imho to convert the molecular featurizers to transforms rather than preprocessing.

You are right, they are special forms of the categorical features. But I would not convert them to botorch input transforms, for the following reasons:

  • Botorch input transforms are super nice if you have to change something between the input representation for the optimizer and the representation of the GP. Especially, if you want it in differentiable manner, as we do it in the PiecewiseLinearGPSurrogate. This is not really the case for the molecular featureizers. The featurization is non differentiable and we can do it easily upfront.
  • The current setup for the transforms is working and I do not see a need to change it.

I am now wondering whether it is necessary to expose botorch's input transforms, or it would be okay to put these transformations within the kernels themselves, so that users would specify the features that the kernel is acting on as well as how they should be transformed. I think this would be simpler to implement, and not less powerful than botorch's input transforms unless I am missing something. It also makes sense because, in some way, kernels are the only reason why we need input transforms in the first place, and different kernels could work on different representations of the same feature.

Two reasons for keeping it:

  1. Practical reason; The transforms are deeply baked into botorch and also well tested. If we want to do it in the kernels, we have to implement it ourselves.
  2. Computational reason: the botorch/gpytorch gps store the training data x*which is used then used at inference time to calculate the covariance matrix/kernel with respect to the test data x: k(x,x*). The trick is that it stores the transformed training data, so you do not need to transform the training data at every inference step. All these mechanism are already implemented and well tested especially on the botorch side. For this reason, in general I want to keep the trasforms, but of course there could be exceptions for edge cases for it.

What I would do now:

  • Release the restriction that CategoricalInputss has to be always one hot encoded and allow also for ordinal encodings.
  • Fix get_categorical_combinations in BotorchStrategy so that it also works for categoricals encoded as ordinals.
  • Implement in get_categorical_combinations also the generation of the molecular ones. Currently you can only run optimizations over molecular spaces if you do not have only continuus ones.
  • Drop the options in BoTorchStrategy for the categorical_method, descriptor_method und discrete_method. Depending on the surrogate, we had here so far the possibility to use continuous relaxations of them during the optimization, which ofc can lead to wrong optima. So I would drop this support (makes the code much easier), especially as we get the GA by @LukasHebing and @sleweke-bayer for mixed spaces.
  • Change the ScalerEnums into actual classes in which we can define on which features/feature types the corresponding transforms should apply.
  • Try to write the MixedSingleTaskGPSurrogate in plain bofire as a SingleTaskGPSurrogate which the explicit kernels and see that we get the same results.
  • Remove all necessary GP implementations and write helper methods that helps you to generate the complex specs.

This should overall, simplify our codebase a lot and will give in general much more flexibility. Of course, there will be also edge cases. What do you think?

Ofc, this should not all be done in this PR by you. So I would suggest, that we try to get this PR (which does not breaks anything) in as fast as possible, agree on the changes from above and then distribute the taks. Fine for you?

@e-dorigatti
Copy link
Collaborator Author

e-dorigatti commented Dec 19, 2024

Thank you for the clarifications @jduerholt. Following your suggestions I have implemented the Hamming kernel working on one-hot features, and verified that one can get the same results as MixedSingleTaskGP (or even better with a different kernel). The kernel mapper decides transparently from the user whether to use this kernel or the one from botorch depending on whether feature names are specified (SingleTaskGP), or not (MixedSingleTaskGP).

@e-dorigatti
Copy link
Collaborator Author

Apologies, I got lost debugging those tests that were failing on GH but not locally, then I noticed other PRs are suffering from the same issue. According to the print statements I added it seems like an issue with differentiation in the formulaic package?

>       assert np.allclose(B, J)
E       assert False
E        +  where False = <function allclose at 0x7f392a5de930>(array([[0., 1., 0., 0., 0., 2.],\n       [0., 0., 1., 0., 0., 1.],\n       [0., 0., 0., 1., 6., 0.]]), array([[[0, 1, 0, 0, 0, 2],\n        [0, 0, 1, 0, 0, 1],\n        [0, 0, 0, 1, 0, 6]]]))
E        +    where <function allclose at 0x7f392a5de930> = np.allclose

tests/bofire/strategies/doe/test_objective.py:52: AssertionError
----------------------------- Captured stdout call -----------------------------
model 1 + x1 + x2 + x3 + x3 ** 2 + x1:x2
diff x1 0 + 1 + 0 + 0 + 0 + x2
diff x2 0 + 0 + 1 + 0 + 0 + x1
diff x3 0 + 0 + 0 + 1 + 0 + (2*x3)

Where diff x3 should be 0 + 0 + 0 + 1 + (2*x3) + 0 (full output here)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kernels on feature subsets
3 participants