Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions on Usage (Output of Acquisition Function Values, Design Methods for Search Space) #333

Open
tatsuya-takakuwa opened this issue Jan 6, 2024 · 7 comments

Comments

@tatsuya-takakuwa
Copy link

Thank you always

Is there a way to obtain the evaluation value of the acquisition function, not just the mean and variance of the predicted results obtained with the .ask function? I want to use it as a clue to prioritize candidates.

Also, if the search space is, for instance, the types of molecules of reagents that can be purchased, and the descriptors generated from the molecular structure of the reagents are the features, the combination of features is fixed. In such a case, how should I design the search space using bofire?"

@jduerholt
Copy link
Contributor

You could take the candidates after their generation and and fed them into the strategy.calc_acquisition method, then you will return the actual acqf values.

def calc_acquisition(

Regarding your second question: I am not sure that I completely understand your question, if you have hand designed descriptors, you could setup the search space using CategoricalDescriptorInput, if you want to use Mordred Descriptors, or Morgan Fingerprints, or Fragment Descriptors and want to generate them on the fly, you can use the CategoricalMolecularInput feature and define within the SurrogateSpecs which featurizer you actually want to use. Note that we currently can work only with fully combinatorical search spaces in case of the CategoricalMolecularInput, mixed search spaces using this alongside continuous inputs is currently still work in progess. If you can provide a bit more details about your problem, I can set up also a minimal working example for you.

Maybe also this tutorial notebook could be helpful for you: https://github.com/experimental-design/bofire/blob/main/tutorials/benchmarks/009-Bayesian_optimization_over_molecules.ipynb

cc: @simonsung06

@tatsuya-takakuwa
Copy link
Author

@jduerholt
Thank you for your reply.
I was able to get the evaluation values of the acquisition function. Thank you very much.

Also, thank you for the information about CategoricalDescriptorInput and CategoricalMolecularInput.

I apologize for the additional questions, but I have two points of inquiry regarding the use of the above:

  1. Variable importance after learning in the surrogate model

    When checking variable importance while building a preliminary model, only the names of keys such as CategoricalDescriptorInput were listed, and the importance of each feature within them was not visible. Is there a way to check the importance of each feature?

  2. The relationship of interaction when two CategoricalDescriptorInputs exist

    When using properties of two molecules as explanatory variables, there are cases where features are created by combining them, such as ratios.
    (For example, I want to combine the properties of solvent and solute molecules with their concentrations.)
    In such cases, can I preprocess CategoricalDescriptorInput to create features that represent combined properties?

@jduerholt
Copy link
Contributor

Nothing to apologize!

Regarding your questions:

  1. Did you use the Permutation Feature Importance feature? Currently it runs only over orginal features and not the transformed ones. In principal, it could be extended in this direction, but this will take a while at least when we do it, as it has currently not the highest priority. But feel free to give it a try! If you are using a SingleTaskGPSurrogate, you could have a look at the lenghtscales, here it is shown how to extract them:

    def lengthscale_importance(surrogate: SingleTaskGPSurrogate) -> pd.Series:
    from the kernel. The current implementation of the method will crash when using it with CategoricalDescriptors, but it should be very easy for you to either extend the method, or just apply the extraction on the fitted GP and assign the lengthscales to the individual featues. In case of questions, I am happy to assist or provide you with an MWE.

  2. This is currently not yet implemented, and only prepared with the ContinuousDescriptorInput, but which is still not yet fully supported in the GPs. For me the open question there is still which mixing rules to apply, so how to weight the explantory features due to the concentrations? Arithmetic mean, geometric mean ...? It would be really cool to integrate this into BoFire and we could brainstorm together how to do it in the best way.

Best,

Johannes

@tatsuya-takakuwa
Copy link
Author

@jduerholt

Thank you very much once again. Regarding categorical descriptors, I managed to resolve it by creating a class that decomposes them when using Cross-validation. Thank you for your support!

As for the mixed rules, there remains discussion in the Baysian optimization community, but for now, creating composite descriptors and implementing recursive feature selection might be better. The support index that includes mixed instant and recursive feature selection has been very helpful.
[Link: https://www.sciencedirect.com/science/article/pii/S0264127520307838?via%3Dihub]

I'm thinking of developing a method to generate simple composite variable combinations from categorical descriptors.

I have an additional question: is it possible to further classify the descriptors registered under categorical descriptors into continuous, discrete, and categorical?

Also, I understand that when experimental data is added via 'tell' in a strategy, it undergoes training.

Is it possible to set up cross-validation or LOO (Leave-One-Out).
This is very important for ensuring
extrapolation in small data, so I would like to add this setting.

Thank you very much for your assistance.

@jduerholt
Copy link
Contributor

HI @tatsuya-takakuwa,

regarding the class that you wrote for cross validation, can you share it? I would be interested.

Regarding the descriptors, currently we support there only ordina ones (meaning continuous and discrete) and there is no furhter classification. But of course you can setup categorical molecular features and use for example mordred descriptors on the fly ...

Regarding tell: If you call tell, the surrogates models will be retrained on the whole dataset, but you can also instantiate the surrogate outside of the strategy and perform cross validation via surrogate.cross_validate:

def cross_validate(

Within the BO loop, you can define frequency_hyperopt, then it uses CV within tell to select the best set of hyperparams and then train with the best hyperparams on the whole dataset.

self.frequency_hyperopt = data_model.frequency_hyperopt

This works for every surrogate that implements a so calles Hyperconfig as the SingleTaskGPSurrogate:

hyperconfig: Optional[SingleTaskGPHyperconfig] = Field(

Was this helpful?

Best,

Johannes

@tatsuya-takakuwa
Copy link
Author

@jduerholt
Thank you for sharing. Here's the translation:
The categorical descriptors are decomposed using the following function.

def decomposition_input_features(Categorical_discriptors):
    df = Categorical_discriptors.to_df()
    # Categorical_discriptors_name_to_list
    Categorical_discriptors_name = df.iloc[:, 1:].columns.tolist()

    discriptors_dict = {}
    for descriptor_name in Categorical_discriptors_name:
        discriptors_dict[descriptor_name] = ContinuousInput(key=descriptor_name, bounds=(df[descriptor_name].min(), df[descriptor_name].max()))

    discriptors_list = list(discriptors_dict.values())

    return discriptors_list

Subsequently, the Input_features are updated, and the ExperimentData is also updated.
I conducted cross-validation using a RandomForest.
As a result, I obtained outcomes similar to those attached. The performance metrics also remained unchanged.

input_features = Inputs(features = decomposition_input_features(Molecule))

train_cv, test_cv, pi = model.cross_validate(
    experiments, 
    folds=5, 
    hooks={"permutation_importance": permutation_importance_hook},
    #hooks={"permutation_importance": permutation_importance_hook, "lengthscale_importance": lengthscale_importance_hook}
    #hooks={"permutation_importance": permutation_importance_hook, "lengthscale_importance": lengthscale_importance_hook}

)
newplot(12)

Thank you also for your advice regarding cross-validation. Adopting your second suggestion made the process simpler
sobo_strategy_data_model = SoboStrategy(domain=domain,surrogate_specs=surrogate_specs,acquisition_function=qNEI(),folds=-1)

@jduerholt
Copy link
Contributor

Regarding your last line, you also have to set the frequency_hyperopt to something larger than zero, only then it will use it ;)

The helper function regarding the cross val is smart! Good idea.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants