Expose surrogate #355

AdrianSosic · 2024-08-30T08:52:17Z

This PR enables convenient access to the surrogate model and posterior predictive distribution via the Campaign class.

AdrianSosic · 2024-08-30T09:04:09Z

@Scienfitz, @AVHopp: Here my proposal how to give convenient high-level access to the model internals through the campaign object.

This makes it super easy for the user to apply model diagnostics. Below an example for SHAP. Notice how the only important piece of code is the lambda x: campaign.get_surrogate().posterior(x).mean callable, nothing else required from the user. And, of course, the "explainer" can now be simply exchanged with any other black-box feature importance method, not necessarily restricted to SHAP.

import numpy as np
import shap

from baybe.campaign import Campaign
from baybe.parameters.numerical import NumericalContinuousParameter
from baybe.recommenders.pure.bayesian.botorch import BotorchRecommender
from baybe.searchspace.core import SearchSpace
from baybe.targets.numerical import NumericalTarget


def blackbox(x: np.ndarray) -> np.ndarray:
    """Quadratic function embedded into higher-dimensional space."""
    assert x.shape[1] >= 2
    return np.power(x[:, [0, 1]].sum(axis=1), 2)


N_PARAMETERS = 10
N_DATA = 100

# Campaign settings
parameters = [
    NumericalContinuousParameter(f"p{i}", (-1, 1)) for i in range(N_PARAMETERS)
]
searchspace = SearchSpace.from_product(parameters)
objective = NumericalTarget("t", "MIN").to_objective()
campaign = Campaign(searchspace, objective, recommender=BotorchRecommender())

# Create measurements at random candidates
measurements = searchspace.continuous.sample_uniform(N_DATA)
measurements["t"] = blackbox(measurements.values)
campaign.add_measurements(measurements)

# Evaluate Shap values
df = campaign.measurements[[p.name for p in campaign.parameters]]
explainer = shap.Explainer(lambda x: campaign.get_surrogate().posterior(x).mean, df)
shap_values = explainer(df)
shap.plots.bar(shap_values)

If we decide this is the way to go, then the next step could be to design a diagnostics subpackage or similar, where we provide generic skeletons for different explainers. One idea would to be provide an abstract baybe "Explainer" class (or similar name) with a suitable interface that offers an abstraction layer around the individual feature importance methods. The goal should be that user can simply specify a diagnostics method of their choice and a corresponding model and get the desired result. Not sure though what are the common requirements for the different feature explaining methods we wish to target.

Also tagging @brandon-holt and @Alex6022 here, who expressed interest in the feature.

baybe/campaign.py

CHANGELOG.md

tests/test_surrogate.py

baybe/exceptions.py

baybe/recommenders/pure/bayesian/base.py

AVHopp

Extremely short and unfinished first round of quick comments.

CHANGELOG.md

baybe/recommenders/meta/base.py

baybe/campaign.py

baybe/surrogates/base.py

baybe/campaign.py

baybe/recommenders/meta/base.py

baybe/recommenders/pure/bayesian/base.py

AVHopp

New approach looks good to me

brandon-holt · 2024-09-17T18:37:26Z

@AdrianSosic @Scienfitz @AVHopp Hi I am trying this approach for a campaign with SubstanceParameters and CustomDiscreteParameters and I can't get it to work with the shap analysis portion of the code

# Evaluate Shap values
df = campaign.measurements[[p.name for p in campaign.parameters]]
explainer = shap.Explainer(lambda x: campaign.get_surrogate().posterior(x).mean, df)
shap_values = explainer(df)
shap.plots.bar(shap_values)

If you run this as is with a campaign that has categorical parameters or any custom encoding, it fails with the error: TypeError: ufunc 'isfinite' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe'' because of the categorical parameters.

I've tried two approaches

Replace categorical parameters with their custom encodings
Replace the categorical parameter strings with a single number encoding

Approach 1 looks like this

from baybe.parameters.substance import SubstanceParameter
from baybe.parameters.custom import CustomDiscreteParameter

df = campaign.measurements[[p.name for p in campaign.parameters]].copy()
original_df = df.copy()

def replace_with(df, lookup_df, replace_col):
    # Dictionary to hold new columns
    new_columns = {col: [None] * len(df) for col in lookup_df.columns}
    
    # Replace values using lookup
    for i, row in df.iterrows():
        lookup_value = row[replace_col]
        if lookup_value in lookup_df.index:
            lookup_row = lookup_df.loc[lookup_value]
            for col in lookup_df.columns:
                new_columns[col][i] = lookup_row[col]
    
    # Create a new DataFrame with the new columns
    new_df = pd.DataFrame(new_columns, index=df.index)
    
    # Concatenate the new columns to the original DataFrame
    df = pd.concat([df, new_df], axis=1)
    
    # Delete the column parameter.name
    df.drop(replace_col, axis=1, inplace=True)
    
    return df

for parameter in campaign.parameters:
    if isinstance(parameter, SubstanceParameter):
        df = replace_with(df, parameter.comp_df, parameter.name)
    elif isinstance(parameter, CustomDiscreteParameter):
        df = replace_with(df, parameter.data, parameter.name)

print(df)

explainer = shap.Explainer(lambda x: campaign.get_surrogate().posterior(x).mean, df, max_evals=5209)
shap_values = explainer(df)
shap.plots.bar(shap_values)

And it fails because you'll be missing the columns you deleted. I tried setting allow_missing and allow_extra to true but that doesnt work either.

Approach 2 looks like this

from baybe.parameters.substance import SubstanceParameter
from baybe.parameters.custom import CustomDiscreteParameter

df = campaign.measurements[[p.name for p in campaign.parameters]].copy()

for parameter in campaign.parameters:
    if isinstance(parameter, SubstanceParameter) or isinstance(parameter, CustomDiscreteParameter):
        df[parameter.name] = df[parameter.name].astype('category')
        df[parameter.name] = df[parameter.name].cat.codes
        # convert df[parameter.name] to float64
        df[parameter.name] = df[parameter.name].astype('float64')
        if 'Labels' in parameter.comp_df.columns:
            parameter.comp_df['Labels'] = parameter.comp_df['Labels'].astype('float64')

print(df)

explainer = shap.Explainer(lambda x: campaign.get_surrogate().posterior(x).mean, df)
shap_values = explainer(df)
shap.plots.bar(shap_values)

And it fails with error: ValueError: You are trying to merge on float64 and object columns for key 'Labels'. If you wish to proceed you should use pd.concat which I don't fully understand.

Can you please help me with a solution to generate shap analysis for campaigns with custom encodings or substanceparameters? Ideally this would give importance scores for each feature in the custom encodings, not just the categorical values themselves.

UPDATE: I modified the code to work, but now I run into this error

from baybe.parameters.substance import SubstanceParameter
from baybe.parameters.custom import CustomDiscreteParameter

global original_df
df = campaign.measurements[[p.name for p in campaign.parameters]].copy()
original_df = df.copy()

def replace_with(df, lookup_df, replace_col):
    # Dictionary to hold new columns
    new_columns = {col: [None] * len(df) for col in lookup_df.columns}
    
    # Replace values using lookup
    for i, row in df.iterrows():
        lookup_value = row[replace_col]
        if lookup_value in lookup_df.index:
            lookup_row = lookup_df.loc[lookup_value]
            for col in lookup_df.columns:
                new_columns[col][i] = lookup_row[col]
    
    # Create a new DataFrame with the new columns
    new_df = pd.DataFrame(new_columns, index=df.index)
    
    # Concatenate the new columns to the original DataFrame
    df = pd.concat([df, new_df], axis=1)
    
    # Delete the column parameter.name
    df.drop(replace_col, axis=1, inplace=True)
    
    return df

for parameter in campaign.parameters:
    if isinstance(parameter, SubstanceParameter):
        df = replace_with(df, parameter.comp_df, parameter.name)
    elif isinstance(parameter, CustomDiscreteParameter):
        df = replace_with(df, parameter.data, parameter.name)

# add a column to df to save the original index of the row
df['original_index'] = df.index

print(df)

def model(x):
    global original_df
    original_indices = x['original_index'].values
    # build a new_df by going through each original_index and adding the corresponding row from original_df to new_df
    new_df = pd.DataFrame(columns=original_df.columns)
    rows = [original_df.loc[index] for index in original_indices]
    new_df = pd.concat(rows, axis=1).T.reset_index(drop=True)
    print(new_df)

    return campaign.get_surrogate().posterior(new_df).mean

explainer = shap.Explainer(model, df, max_evals=5211)
shap_values = explainer(df)
shap.plots.bar(shap_values)

AttributeError: 'GaussianProcessSurrogate' object has no attribute '_input_scaler'

AVHopp · 2024-09-18T07:04:42Z

@brandon-holt Can you please post this as an issue and mention the corresponding PR there? It is easier for us if we have everything that requires our input in a single place, and discussing the issue is way easier there. Thanks :)
EDIT: Also, when moving your example here, please provide a minimal search space and full setup that creates the error, as the definition of the campaign object is not part of your code. Feel free to just use arbitrary dummy values as long as they recreate the error, but try to keep the space small.

zhensongds · 2024-09-18T21:53:47Z

Hi @AdrianSosic @Scienfitz @AVHopp , thank you for the great work! I typically work with multi-objective optimization use cases. Do you have an estimated timeline for when it will support multi-target mode?

AVHopp · 2024-09-19T06:18:31Z

@zhensongds Thanks for your interest in BayBE :) Could you please ask this question in our "Issues" tab here on github? We'd prefer to have all of our interaction at a single place since there is the risk of us not spotting questions otherwise. Also, your question is then easier to also find for others who might also be interested in the answer.

AdrianSosic · 2024-09-20T08:10:36Z

Hi @zhensongds. Yes, we'd appreciate if you could ask any further questions in form of issues to streamline communication. However, now that the question is already here: multi target optimization support is planned for 2024Q4, and exposing the corresponding surrogate models will happen along the way 👌 I'm currently on vacation til early October but will start working on it once I'm back.

zhensongds · 2024-09-24T20:08:16Z

Thank you @AdrianSosic and @AVHopp . I’m excited for the upcoming updates! I’ll post in the issue if I have more questions from now on.

AdrianSosic · 2024-10-07T12:55:37Z

@AdrianSosic @Scienfitz @AVHopp Hi I am trying this approach for a campaign with SubstanceParameters and CustomDiscreteParameters and I can't get it to work with the shap analysis portion of the code

Hi @brandon-holt. Just checked and think that this is not a problem with baybe per se but rather seems due to the fact that the shap package does not like to process dataframes with non-numeric columns. I have to say I'm not very experienced with the different explainers they offer and thus I'm not sure if there is a way around from the shap side. However, the obvious solution would be to work with the computational representation of the data instead of the experimental one, which then only contains float values. I'd say let's first see how far #391 brings us 👍🏼 Perhaps the problem will become obsolete then

AdrianSosic added the new feature New functionality label Aug 30, 2024

AdrianSosic self-assigned this Aug 30, 2024

AdrianSosic requested review from Scienfitz and AVHopp as code owners August 30, 2024 08:52

AdrianSosic force-pushed the feature/expose_models branch from e3d414e to 6d2d2a8 Compare August 30, 2024 11:36

AdrianSosic commented Aug 30, 2024

View reviewed changes

baybe/campaign.py Show resolved Hide resolved

Scienfitz reviewed Aug 30, 2024

View reviewed changes

CHANGELOG.md Show resolved Hide resolved

tests/test_surrogate.py Show resolved Hide resolved

baybe/exceptions.py Outdated Show resolved Hide resolved

This was referenced Aug 30, 2024

Upcoming Diagnostics Package #357

Open

Shapley values #335

Closed

AdrianSosic force-pushed the feature/expose_models branch from d985ae6 to 55fd325 Compare August 30, 2024 19:11

AdrianSosic commented Sep 3, 2024

View reviewed changes

baybe/recommenders/pure/bayesian/base.py Show resolved Hide resolved

AVHopp reviewed Sep 3, 2024

View reviewed changes

CHANGELOG.md Show resolved Hide resolved

baybe/recommenders/meta/base.py Outdated Show resolved Hide resolved

baybe/recommenders/meta/base.py Show resolved Hide resolved

Scienfitz mentioned this pull request Sep 4, 2024

Access to non-persistent data such as acqf functions and the overall model #78

Closed

AdrianSosic force-pushed the feature/expose_models branch from 55fd325 to 29e16cb Compare September 4, 2024 20:25

AdrianSosic commented Sep 4, 2024

View reviewed changes

baybe/campaign.py Show resolved Hide resolved

AVHopp approved these changes Sep 6, 2024

View reviewed changes

Scienfitz approved these changes Sep 6, 2024

View reviewed changes

baybe/recommenders/pure/bayesian/base.py Show resolved Hide resolved

baybe/recommenders/pure/bayesian/base.py Show resolved Hide resolved

AdrianSosic force-pushed the feature/expose_models branch 3 times, most recently from 97c800c to 4120fbb Compare September 6, 2024 16:05

AdrianSosic added 9 commits September 7, 2024 20:15

Expose surrogate and posterior via campaign

6f9100b

Add caching test

b1cc689

Update CHANGELOG.md

5c746ea

Add joblib dependency

6e944df

Add upper version limits to remaining core dependencies

0539436

Fix mypy issues

fdf2509

Update lockfile

cc73c73

Add get_surrogate method to BayesianRecommender

ae98592

Disable gradient calculation for posteriors in campaigns

8052859

AdrianSosic and others added 10 commits September 7, 2024 20:15

Limit surrogate access to single untransformed targets

14a57b0

Rename meta recommender flag

01092a5

Add temporary workaround to enable meta data attribute serialization

3ef526c

Add missing type annotation for measurement hash

c9b2a5c

Make protocol classes slotted

e0a4258

Add missing _input_scaler attribute

24b4d41

Temporarily disable slots for recommenders

ab564e6

Improve workaround and also apply it for surrogate model renaming

79cb6f6

Silence mypy error

0aaf300

Add comment

737c937

Scienfitz force-pushed the feature/expose_models branch from 591381b to 737c937 Compare September 7, 2024 18:17

Use set of recommender ids

46441b9

AVHopp approved these changes Sep 9, 2024

View reviewed changes

Scienfitz merged commit 9bda168 into main Sep 9, 2024
9 of 11 checks passed

Scienfitz deleted the feature/expose_models branch September 9, 2024 11:26

Alex6022 mentioned this pull request Oct 4, 2024

Feature/shap utils #391

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expose surrogate #355

Expose surrogate #355

AdrianSosic commented Aug 30, 2024

AdrianSosic commented Aug 30, 2024 •

edited

Loading

AVHopp left a comment

AVHopp left a comment

brandon-holt commented Sep 17, 2024 •

edited

Loading

AVHopp commented Sep 18, 2024 •

edited

Loading

zhensongds commented Sep 18, 2024

AVHopp commented Sep 19, 2024

AdrianSosic commented Sep 20, 2024

zhensongds commented Sep 24, 2024

AdrianSosic commented Oct 7, 2024

Expose surrogate #355

Expose surrogate #355

Conversation

AdrianSosic commented Aug 30, 2024

AdrianSosic commented Aug 30, 2024 • edited Loading

AVHopp left a comment

Choose a reason for hiding this comment

AVHopp left a comment

Choose a reason for hiding this comment

brandon-holt commented Sep 17, 2024 • edited Loading

AVHopp commented Sep 18, 2024 • edited Loading

zhensongds commented Sep 18, 2024

AVHopp commented Sep 19, 2024

AdrianSosic commented Sep 20, 2024

zhensongds commented Sep 24, 2024

AdrianSosic commented Oct 7, 2024

AdrianSosic commented Aug 30, 2024 •

edited

Loading

brandon-holt commented Sep 17, 2024 •

edited

Loading

AVHopp commented Sep 18, 2024 •

edited

Loading