[Review] Feature: Balanced-OPE estimators #146

fullflu · 2021-11-18T08:48:03Z

Overview

Add Balanced OPE (B-OPE) estimators

Tasks

Details

Self-normalized Balanced IPS

It is not trivial how to handle action features of a stochastic evaluation policy. I implement two types of method by evaluation_policy_sample_method.

raw: action_dist_at_position are directly encoded as action features.
sample: actions are sampled.
3. weighted_loss: for each round, n_actions rows are duplicated. For each duplicated row, action features are represented by one-hot encoding of each action. Classification models are trained with sample_weight, where sample_weight is the probability that the corresponding action is sampled (action_dist_at_position[:, action_idx]).

Reference

https://arxiv.org/abs/1906.03694

fullflu · 2021-11-18T08:57:10Z

Observation

Initial implementation of a self-normalized balanced ipe estimator tells us the following fact (I will commit a notebook later):

When the evaluation policy is deterministic, three types of evaluation_policy_sample_method works as good as existing estimators such as IPS and DR (raw and sample result in the same output).
When the evaluation policy is stochastic, raw does not work well (inf) and sample leads to poor performance. However, weighted_loss has performance comparable to existing estimators.

Policy value estimation of deterministic evaluation policy (IPWLearner)
- Ground truth: 0.6205135606005601
- raw: 0.633716233203124
- sample: 0.633716233203124
- weighted_loss: 0.6312631775591353
- IPS: 0.703706
- DR: 0.604306

Policy value estimation of stochastic evaluation policy (IPW Learner)
- Ground truth: 0.5025620586676971
- raw: nan
- sample: 0.5894081429941057
- weighted_loss: 0.506395941886161
- IPS: 0.509074
- DR: 0.506203

Discussion

The reason why raw does not work well would be because the domain of action features of factual row (\in {0, 1}) are different with that of stochastic evaluation policy (\in [0, 1]). In fact, when we use RandomForestClassifier, the result of factual rows are all zero, and that of counterfactual rows are all one.
The reason why sample leads to poor performance would be because a sampling procedure discard information of evaluation policy distribution.

Direction

When evaluation policy is stochastic, weighted_loss should be used.
When evaluation policy is deterministic, raw can be used.
sample seems to be unnecessary
We should implement ClassificationModel class in classification_model.py and simplify estimators.py

fullflu · 2021-11-18T09:31:08Z

OPE script (stochastic policy)

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

# import open bandit pipeline (obp)
from obp.dataset import (
    SyntheticBanditDataset,
    logistic_reward_function,
    linear_behavior_policy,
)

from obp.policy import IPWLearner
from obp.ope import (
    OffPolicyEvaluation,
    RegressionModel,
    InverseProbabilityWeighting as IPS,
    SelfNormalizedInverseProbabilityWeighting as SNIPS,
    DirectMethod as DM,
    DoublyRobust as DR,
    DoublyRobustWithShrinkage as DRos,
    BalancedInverseProbabilityWeighting as BIPW,
)


# define a dataset class
n_actions = 10
dim_context = 8
len_list = 1
random_state = 12345
dataset = SyntheticBanditDataset(
    n_actions=n_actions,
    dim_context=dim_context,
    beta=0.2,
    reward_function=logistic_reward_function,
    behavior_policy_function=linear_behavior_policy,
    random_state=random_state,
)

# training data is used to train an evaluation policy
train_bandit_data = dataset.obtain_batch_bandit_feedback(n_rounds=5000)

# test bandit data is used to approximate the ground-truth policy value
test_bandit_data = dataset.obtain_batch_bandit_feedback(n_rounds=100000)

# evaluation policy training
ipw_learner = IPWLearner(
    n_actions=dataset.n_actions,
    base_classifier=RandomForestClassifier(random_state=random_state),
)
ipw_learner.fit(
    context=train_bandit_data["context"],
    action=train_bandit_data["action"],
    reward=train_bandit_data["reward"],
    pscore=train_bandit_data["pscore"],
)
action_dist_ipw_train = ipw_learner.predict_proba(
    context=train_bandit_data["context"],
)
action_dist_ipw_test = ipw_learner.predict_proba(
    context=test_bandit_data["context"],
)
policy_value_of_ipw = dataset.calc_ground_truth_policy_value(
    expected_reward=test_bandit_data["expected_reward"],
    action_dist=action_dist_ipw_test,
)

num_data = 1000

validation_bandit_data = dataset.obtain_batch_bandit_feedback(n_rounds=num_data)

# make decisions on vlidation data
action_dist_ipw_val = ipw_learner.predict_proba(
    context=validation_bandit_data["context"],
)

# OPE using validation data
regression_model = RegressionModel(
    n_actions=dataset.n_actions,
    base_model=LogisticRegression(C=100, max_iter=10000, random_state=random_state),
)
estimated_rewards = regression_model.fit_predict(
    context=validation_bandit_data["context"],  # context; x
    action=validation_bandit_data["action"],  # action; a
    reward=validation_bandit_data["reward"],  # reward; r
    n_folds=2,  # 2-fold cross fitting
    random_state=12345,
)

bipw = BIPW(
    estimator_name="BIPW",
    len_list=len_list,
    n_actions=n_actions,
    fit_random_state=random_state,
    base_model=RandomForestClassifier(random_state=random_state),
)

bipw.fit(
    action=train_bandit_data["action"],
    context=train_bandit_data["context"],
    action_dist=action_dist_ipw_train,
    evaluation_policy_sample_method="raw",
    position=train_bandit_data["position"],
)

pv = bipw.estimate_policy_value(
    reward=validation_bandit_data["reward"],
    action=validation_bandit_data["action"],
    position=validation_bandit_data["position"],
    action_dist=action_dist_ipw_val,
    context=validation_bandit_data["context"],
)

bipw.fit(
    action=train_bandit_data["action"],
    context=train_bandit_data["context"],
    action_dist=action_dist_ipw_train,
    evaluation_policy_sample_method="weighted_loss",
    position=train_bandit_data["position"],
)

pv_wl = bipw.estimate_policy_value(
    reward=validation_bandit_data["reward"],
    action=validation_bandit_data["action"],
    position=validation_bandit_data["position"],
    action_dist=action_dist_ipw_val,
    context=validation_bandit_data["context"],
)

bipw.fit(
    action=train_bandit_data["action"],
    context=train_bandit_data["context"],
    action_dist=action_dist_ipw_train,
    evaluation_policy_sample_method="sample",
    position=train_bandit_data["position"],
)

pv_sample = bipw.estimate_policy_value(
    reward=validation_bandit_data["reward"],
    action=validation_bandit_data["action"],
    position=validation_bandit_data["position"],
    action_dist=action_dist_ipw_val,
    context=validation_bandit_data["context"],
)

ope = OffPolicyEvaluation(
    bandit_feedback=validation_bandit_data,
    ope_estimators=[
        IPS(estimator_name="IPS"),
        DM(estimator_name="DM"),
        IPS(lambda_=5, estimator_name="CIPS"),
        SNIPS(estimator_name="SNIPS"),
        DR(estimator_name="DR"),
        DRos(lambda_=500, estimator_name="DRos"),
    ],
)


squared_errors = ope.evaluate_performance_of_estimators(
    ground_truth_policy_value=policy_value_of_ipw,  # V(\pi_e)
    action_dist=action_dist_ipw_val,  # \pi_e(a|x)
    estimated_rewards_by_reg_model=estimated_rewards,  # \hat{q}(x,a)
    metric="se",  # squared error
)

ope_res = ope.summarize_off_policy_estimates(
    action_dist=action_dist_ipw_val,  # \pi_e(a|x)
    estimated_rewards_by_reg_model=estimated_rewards,  # \hat{q}(x,a)
)

print(squared_errors)
print(ope_res)
print(policy_value_of_ipw)
print(pv, pv_sample, pv_wl)

fullflu · 2021-11-28T17:42:42Z

I remove weighted_loss option because classifier with weighted_loss is not well-trained (in terms of ROC-AUC and calibration plot of eval_result.)

obp/ope/classification_model.py

obp/ope/meta.py

obp/ope/estimators.py

usaito · 2021-12-07T18:08:56Z

@fullflu Thanks again for the great work!

The most important point I want to mention here is the definition of B-IPW.
In my understanding, we can simply use the importance weights estimated by the classification, and do not have to (re-)take the expectation over the evaluation policy in the estimator (as I made a suggestion above).

obp/ope/classification_model.py

obp/ope/meta.py

Co-authored-by: yuta-saito <32621707+usaito@users.noreply.github.com>

…importance weight estimator

obp/utils.py

obp/ope/meta.py

obp/ope/classification_model.py

tests/ope/test_offline_estimation_performance.py

obp/ope/classification_model.py

apply 2nd review Co-authored-by: yuta-saito <32621707+usaito@users.noreply.github.com>

initial commit of b-ope

55653a8

fullflu added 5 commits November 26, 2021 21:09

wip (balanced ope)

dd22fea

remove unused options and functions; add pscore estimator

6ea797e

add estimated_pscore and fix past test

3257831

fix init

4599f92

add bipw to meta.py

e725b42

fix classification model

4386e2d

fullflu changed the title ~~WIP Feature: Balanced-OPE estimators~~ [Review] Feature: Balanced-OPE estimators Nov 28, 2021

fullflu added 3 commits November 29, 2021 03:03

add stochastic evalation policy bipw notebook

d2c4b11

fix test meta

0774cfd

fix existing test

cf31337

usaito reviewed Dec 7, 2021

View reviewed changes