Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose surrogate #355

Merged
merged 20 commits into from
Sep 9, 2024
Merged

Expose surrogate #355

merged 20 commits into from
Sep 9, 2024

Conversation

AdrianSosic
Copy link
Collaborator

This PR enables convenient access to the surrogate model and posterior predictive distribution via the Campaign class.

@AdrianSosic AdrianSosic added the new feature New functionality label Aug 30, 2024
@AdrianSosic AdrianSosic self-assigned this Aug 30, 2024
@AdrianSosic
Copy link
Collaborator Author

AdrianSosic commented Aug 30, 2024

@Scienfitz, @AVHopp: Here my proposal how to give convenient high-level access to the model internals through the campaign object.

This makes it super easy for the user to apply model diagnostics. Below an example for SHAP. Notice how the only important piece of code is the lambda x: campaign.get_surrogate().posterior(x).mean callable, nothing else required from the user. And, of course, the "explainer" can now be simply exchanged with any other black-box feature importance method, not necessarily restricted to SHAP.

import numpy as np
import shap

from baybe.campaign import Campaign
from baybe.parameters.numerical import NumericalContinuousParameter
from baybe.recommenders.pure.bayesian.botorch import BotorchRecommender
from baybe.searchspace.core import SearchSpace
from baybe.targets.numerical import NumericalTarget


def blackbox(x: np.ndarray) -> np.ndarray:
    """Quadratic function embedded into higher-dimensional space."""
    assert x.shape[1] >= 2
    return np.power(x[:, [0, 1]].sum(axis=1), 2)


N_PARAMETERS = 10
N_DATA = 100

# Campaign settings
parameters = [
    NumericalContinuousParameter(f"p{i}", (-1, 1)) for i in range(N_PARAMETERS)
]
searchspace = SearchSpace.from_product(parameters)
objective = NumericalTarget("t", "MIN").to_objective()
campaign = Campaign(searchspace, objective, recommender=BotorchRecommender())

# Create measurements at random candidates
measurements = searchspace.continuous.sample_uniform(N_DATA)
measurements["t"] = blackbox(measurements.values)
campaign.add_measurements(measurements)

# Evaluate Shap values
df = campaign.measurements[[p.name for p in campaign.parameters]]
explainer = shap.Explainer(lambda x: campaign.get_surrogate().posterior(x).mean, df)
shap_values = explainer(df)
shap.plots.bar(shap_values)

If we decide this is the way to go, then the next step could be to design a diagnostics subpackage or similar, where we provide generic skeletons for different explainers. One idea would to be provide an abstract baybe "Explainer" class (or similar name) with a suitable interface that offers an abstraction layer around the individual feature importance methods. The goal should be that user can simply specify a diagnostics method of their choice and a corresponding model and get the desired result. Not sure though what are the common requirements for the different feature explaining methods we wish to target.

Also tagging @brandon-holt and @Alex6022 here, who expressed interest in the feature.

CHANGELOG.md Show resolved Hide resolved
tests/test_surrogate.py Show resolved Hide resolved
baybe/exceptions.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@AVHopp AVHopp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extremely short and unfinished first round of quick comments.

CHANGELOG.md Show resolved Hide resolved
baybe/recommenders/meta/base.py Outdated Show resolved Hide resolved
baybe/recommenders/meta/base.py Show resolved Hide resolved
baybe/surrogates/base.py Outdated Show resolved Hide resolved
baybe/surrogates/base.py Show resolved Hide resolved
baybe/campaign.py Show resolved Hide resolved
baybe/recommenders/meta/base.py Outdated Show resolved Hide resolved
baybe/recommenders/meta/base.py Show resolved Hide resolved
baybe/recommenders/meta/base.py Outdated Show resolved Hide resolved
@AdrianSosic AdrianSosic force-pushed the feature/expose_models branch 3 times, most recently from 97c800c to 4120fbb Compare September 6, 2024 16:05
Copy link
Collaborator

@AVHopp AVHopp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New approach looks good to me

@Scienfitz Scienfitz merged commit 9bda168 into main Sep 9, 2024
9 of 11 checks passed
@Scienfitz Scienfitz deleted the feature/expose_models branch September 9, 2024 11:26
@brandon-holt
Copy link
Contributor

brandon-holt commented Sep 17, 2024

@AdrianSosic @Scienfitz @AVHopp Hi I am trying this approach for a campaign with SubstanceParameters and CustomDiscreteParameters and I can't get it to work with the shap analysis portion of the code

# Evaluate Shap values
df = campaign.measurements[[p.name for p in campaign.parameters]]
explainer = shap.Explainer(lambda x: campaign.get_surrogate().posterior(x).mean, df)
shap_values = explainer(df)
shap.plots.bar(shap_values)

If you run this as is with a campaign that has categorical parameters or any custom encoding, it fails with the error: TypeError: ufunc 'isfinite' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe'' because of the categorical parameters.

I've tried two approaches

  1. Replace categorical parameters with their custom encodings
  2. Replace the categorical parameter strings with a single number encoding

Approach 1 looks like this

from baybe.parameters.substance import SubstanceParameter
from baybe.parameters.custom import CustomDiscreteParameter

df = campaign.measurements[[p.name for p in campaign.parameters]].copy()
original_df = df.copy()

def replace_with(df, lookup_df, replace_col):
    # Dictionary to hold new columns
    new_columns = {col: [None] * len(df) for col in lookup_df.columns}
    
    # Replace values using lookup
    for i, row in df.iterrows():
        lookup_value = row[replace_col]
        if lookup_value in lookup_df.index:
            lookup_row = lookup_df.loc[lookup_value]
            for col in lookup_df.columns:
                new_columns[col][i] = lookup_row[col]
    
    # Create a new DataFrame with the new columns
    new_df = pd.DataFrame(new_columns, index=df.index)
    
    # Concatenate the new columns to the original DataFrame
    df = pd.concat([df, new_df], axis=1)
    
    # Delete the column parameter.name
    df.drop(replace_col, axis=1, inplace=True)
    
    return df

for parameter in campaign.parameters:
    if isinstance(parameter, SubstanceParameter):
        df = replace_with(df, parameter.comp_df, parameter.name)
    elif isinstance(parameter, CustomDiscreteParameter):
        df = replace_with(df, parameter.data, parameter.name)

print(df)

explainer = shap.Explainer(lambda x: campaign.get_surrogate().posterior(x).mean, df, max_evals=5209)
shap_values = explainer(df)
shap.plots.bar(shap_values)

And it fails because you'll be missing the columns you deleted. I tried setting allow_missing and allow_extra to true but that doesnt work either.

Approach 2 looks like this

from baybe.parameters.substance import SubstanceParameter
from baybe.parameters.custom import CustomDiscreteParameter

df = campaign.measurements[[p.name for p in campaign.parameters]].copy()

for parameter in campaign.parameters:
    if isinstance(parameter, SubstanceParameter) or isinstance(parameter, CustomDiscreteParameter):
        df[parameter.name] = df[parameter.name].astype('category')
        df[parameter.name] = df[parameter.name].cat.codes
        # convert df[parameter.name] to float64
        df[parameter.name] = df[parameter.name].astype('float64')
        if 'Labels' in parameter.comp_df.columns:
            parameter.comp_df['Labels'] = parameter.comp_df['Labels'].astype('float64')

print(df)

explainer = shap.Explainer(lambda x: campaign.get_surrogate().posterior(x).mean, df)
shap_values = explainer(df)
shap.plots.bar(shap_values)

And it fails with error: ValueError: You are trying to merge on float64 and object columns for key 'Labels'. If you wish to proceed you should use pd.concat which I don't fully understand.

Can you please help me with a solution to generate shap analysis for campaigns with custom encodings or substanceparameters? Ideally this would give importance scores for each feature in the custom encodings, not just the categorical values themselves.

UPDATE: I modified the code to work, but now I run into this error

from baybe.parameters.substance import SubstanceParameter
from baybe.parameters.custom import CustomDiscreteParameter

global original_df
df = campaign.measurements[[p.name for p in campaign.parameters]].copy()
original_df = df.copy()

def replace_with(df, lookup_df, replace_col):
    # Dictionary to hold new columns
    new_columns = {col: [None] * len(df) for col in lookup_df.columns}
    
    # Replace values using lookup
    for i, row in df.iterrows():
        lookup_value = row[replace_col]
        if lookup_value in lookup_df.index:
            lookup_row = lookup_df.loc[lookup_value]
            for col in lookup_df.columns:
                new_columns[col][i] = lookup_row[col]
    
    # Create a new DataFrame with the new columns
    new_df = pd.DataFrame(new_columns, index=df.index)
    
    # Concatenate the new columns to the original DataFrame
    df = pd.concat([df, new_df], axis=1)
    
    # Delete the column parameter.name
    df.drop(replace_col, axis=1, inplace=True)
    
    return df

for parameter in campaign.parameters:
    if isinstance(parameter, SubstanceParameter):
        df = replace_with(df, parameter.comp_df, parameter.name)
    elif isinstance(parameter, CustomDiscreteParameter):
        df = replace_with(df, parameter.data, parameter.name)

# add a column to df to save the original index of the row
df['original_index'] = df.index

print(df)

def model(x):
    global original_df
    original_indices = x['original_index'].values
    # build a new_df by going through each original_index and adding the corresponding row from original_df to new_df
    new_df = pd.DataFrame(columns=original_df.columns)
    rows = [original_df.loc[index] for index in original_indices]
    new_df = pd.concat(rows, axis=1).T.reset_index(drop=True)
    print(new_df)

    return campaign.get_surrogate().posterior(new_df).mean

explainer = shap.Explainer(model, df, max_evals=5211)
shap_values = explainer(df)
shap.plots.bar(shap_values)

AttributeError: 'GaussianProcessSurrogate' object has no attribute '_input_scaler'

@AVHopp
Copy link
Collaborator

AVHopp commented Sep 18, 2024

@brandon-holt Can you please post this as an issue and mention the corresponding PR there? It is easier for us if we have everything that requires our input in a single place, and discussing the issue is way easier there. Thanks :)
EDIT: Also, when moving your example here, please provide a minimal search space and full setup that creates the error, as the definition of the campaign object is not part of your code. Feel free to just use arbitrary dummy values as long as they recreate the error, but try to keep the space small.

@zhensongds
Copy link

Hi @AdrianSosic @Scienfitz @AVHopp , thank you for the great work! I typically work with multi-objective optimization use cases. Do you have an estimated timeline for when it will support multi-target mode?
image

@AVHopp
Copy link
Collaborator

AVHopp commented Sep 19, 2024

@zhensongds Thanks for your interest in BayBE :) Could you please ask this question in our "Issues" tab here on github? We'd prefer to have all of our interaction at a single place since there is the risk of us not spotting questions otherwise. Also, your question is then easier to also find for others who might also be interested in the answer.

@AdrianSosic
Copy link
Collaborator Author

Hi @zhensongds. Yes, we'd appreciate if you could ask any further questions in form of issues to streamline communication. However, now that the question is already here: multi target optimization support is planned for 2024Q4, and exposing the corresponding surrogate models will happen along the way 👌 I'm currently on vacation til early October but will start working on it once I'm back.

@zhensongds
Copy link

Thank you @AdrianSosic and @AVHopp . I’m excited for the upcoming updates! I’ll post in the issue if I have more questions from now on.

@Alex6022 Alex6022 mentioned this pull request Oct 4, 2024
@AdrianSosic
Copy link
Collaborator Author

@AdrianSosic @Scienfitz @AVHopp Hi I am trying this approach for a campaign with SubstanceParameters and CustomDiscreteParameters and I can't get it to work with the shap analysis portion of the code

Hi @brandon-holt. Just checked and think that this is not a problem with baybe per se but rather seems due to the fact that the shap package does not like to process dataframes with non-numeric columns. I have to say I'm not very experienced with the different explainers they offer and thus I'm not sure if there is a way around from the shap side. However, the obvious solution would be to work with the computational representation of the data instead of the experimental one, which then only contains float values. I'd say let's first see how far #391 brings us 👍🏼 Perhaps the problem will become obsolete then

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new feature New functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants