Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Review] Feature: Balanced-OPE estimators #146

Merged
merged 20 commits into from
Jan 12, 2022
Merged

Conversation

fullflu
Copy link
Contributor

@fullflu fullflu commented Nov 18, 2021

Overview

  • Add Balanced OPE (B-OPE) estimators

Tasks

  • initial implementation of a self-normalized balanced ips estimator (to check the behavior of my naive implementation and consider the direction of final implementation)
  • add cross fitting
  • fix action sampling procedure
  • add pscore estimator
  • add estimated_pscore to ope estimators
  • add importance_sampling_ratio to ope estimators
  • fix meta.py
  • add calibration
  • evaluate evaluation_policy_sample_method using various types of classifiers and hyperparameters
  • create evaluation notebook
  • request a 1st review and reflect the response
  • add balanced dr estimators and report the behavior of several B-OPE estimators
  • fix existing tests
  • add new tests
  • request a 2nd review and reflect the response

Details

Self-normalized Balanced IPS

It is not trivial how to handle action features of a stochastic evaluation policy. I implement two types of method by evaluation_policy_sample_method.

  1. raw: action_dist_at_position are directly encoded as action features.
  2. sample: actions are sampled.
    3. weighted_loss: for each round, n_actions rows are duplicated. For each duplicated row, action features are represented by one-hot encoding of each action. Classification models are trained with sample_weight, where sample_weight is the probability that the corresponding action is sampled (action_dist_at_position[:, action_idx]).

Reference

https://arxiv.org/abs/1906.03694

Sorry, something went wrong.

@fullflu
Copy link
Contributor Author

fullflu commented Nov 18, 2021

Observation

Initial implementation of a self-normalized balanced ipe estimator tells us the following fact (I will commit a notebook later):

  1. When the evaluation policy is deterministic, three types of evaluation_policy_sample_method works as good as existing estimators such as IPS and DR (raw and sample result in the same output).
  2. When the evaluation policy is stochastic, raw does not work well (inf) and sample leads to poor performance. However, weighted_loss has performance comparable to existing estimators.
Policy value estimation of deterministic evaluation policy (IPWLearner)
- Ground truth: 0.6205135606005601
- raw: 0.633716233203124
- sample: 0.633716233203124
- weighted_loss: 0.6312631775591353
- IPS: 0.703706
- DR: 0.604306

Policy value estimation of stochastic evaluation policy (IPW Learner)
- Ground truth: 0.5025620586676971
- raw: nan
- sample: 0.5894081429941057
- weighted_loss: 0.506395941886161
- IPS: 0.509074
- DR: 0.506203

Discussion

  • The reason why raw does not work well would be because the domain of action features of factual row (\in {0, 1}) are different with that of stochastic evaluation policy (\in [0, 1]). In fact, when we use RandomForestClassifier, the result of factual rows are all zero, and that of counterfactual rows are all one.
  • The reason why sample leads to poor performance would be because a sampling procedure discard information of evaluation policy distribution.

Direction

  • When evaluation policy is stochastic, weighted_loss should be used.
  • When evaluation policy is deterministic, raw can be used.
  • sample seems to be unnecessary
  • We should implement ClassificationModel class in classification_model.py and simplify estimators.py

@fullflu
Copy link
Contributor Author

fullflu commented Nov 18, 2021

OPE script (stochastic policy)

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

# import open bandit pipeline (obp)
from obp.dataset import (
    SyntheticBanditDataset,
    logistic_reward_function,
    linear_behavior_policy,
)

from obp.policy import IPWLearner
from obp.ope import (
    OffPolicyEvaluation,
    RegressionModel,
    InverseProbabilityWeighting as IPS,
    SelfNormalizedInverseProbabilityWeighting as SNIPS,
    DirectMethod as DM,
    DoublyRobust as DR,
    DoublyRobustWithShrinkage as DRos,
    BalancedInverseProbabilityWeighting as BIPW,
)


# define a dataset class
n_actions = 10
dim_context = 8
len_list = 1
random_state = 12345
dataset = SyntheticBanditDataset(
    n_actions=n_actions,
    dim_context=dim_context,
    beta=0.2,
    reward_function=logistic_reward_function,
    behavior_policy_function=linear_behavior_policy,
    random_state=random_state,
)

# training data is used to train an evaluation policy
train_bandit_data = dataset.obtain_batch_bandit_feedback(n_rounds=5000)

# test bandit data is used to approximate the ground-truth policy value
test_bandit_data = dataset.obtain_batch_bandit_feedback(n_rounds=100000)

# evaluation policy training
ipw_learner = IPWLearner(
    n_actions=dataset.n_actions,
    base_classifier=RandomForestClassifier(random_state=random_state),
)
ipw_learner.fit(
    context=train_bandit_data["context"],
    action=train_bandit_data["action"],
    reward=train_bandit_data["reward"],
    pscore=train_bandit_data["pscore"],
)
action_dist_ipw_train = ipw_learner.predict_proba(
    context=train_bandit_data["context"],
)
action_dist_ipw_test = ipw_learner.predict_proba(
    context=test_bandit_data["context"],
)
policy_value_of_ipw = dataset.calc_ground_truth_policy_value(
    expected_reward=test_bandit_data["expected_reward"],
    action_dist=action_dist_ipw_test,
)

num_data = 1000

validation_bandit_data = dataset.obtain_batch_bandit_feedback(n_rounds=num_data)

# make decisions on vlidation data
action_dist_ipw_val = ipw_learner.predict_proba(
    context=validation_bandit_data["context"],
)

# OPE using validation data
regression_model = RegressionModel(
    n_actions=dataset.n_actions,
    base_model=LogisticRegression(C=100, max_iter=10000, random_state=random_state),
)
estimated_rewards = regression_model.fit_predict(
    context=validation_bandit_data["context"],  # context; x
    action=validation_bandit_data["action"],  # action; a
    reward=validation_bandit_data["reward"],  # reward; r
    n_folds=2,  # 2-fold cross fitting
    random_state=12345,
)

bipw = BIPW(
    estimator_name="BIPW",
    len_list=len_list,
    n_actions=n_actions,
    fit_random_state=random_state,
    base_model=RandomForestClassifier(random_state=random_state),
)

bipw.fit(
    action=train_bandit_data["action"],
    context=train_bandit_data["context"],
    action_dist=action_dist_ipw_train,
    evaluation_policy_sample_method="raw",
    position=train_bandit_data["position"],
)

pv = bipw.estimate_policy_value(
    reward=validation_bandit_data["reward"],
    action=validation_bandit_data["action"],
    position=validation_bandit_data["position"],
    action_dist=action_dist_ipw_val,
    context=validation_bandit_data["context"],
)

bipw.fit(
    action=train_bandit_data["action"],
    context=train_bandit_data["context"],
    action_dist=action_dist_ipw_train,
    evaluation_policy_sample_method="weighted_loss",
    position=train_bandit_data["position"],
)

pv_wl = bipw.estimate_policy_value(
    reward=validation_bandit_data["reward"],
    action=validation_bandit_data["action"],
    position=validation_bandit_data["position"],
    action_dist=action_dist_ipw_val,
    context=validation_bandit_data["context"],
)

bipw.fit(
    action=train_bandit_data["action"],
    context=train_bandit_data["context"],
    action_dist=action_dist_ipw_train,
    evaluation_policy_sample_method="sample",
    position=train_bandit_data["position"],
)

pv_sample = bipw.estimate_policy_value(
    reward=validation_bandit_data["reward"],
    action=validation_bandit_data["action"],
    position=validation_bandit_data["position"],
    action_dist=action_dist_ipw_val,
    context=validation_bandit_data["context"],
)

ope = OffPolicyEvaluation(
    bandit_feedback=validation_bandit_data,
    ope_estimators=[
        IPS(estimator_name="IPS"),
        DM(estimator_name="DM"),
        IPS(lambda_=5, estimator_name="CIPS"),
        SNIPS(estimator_name="SNIPS"),
        DR(estimator_name="DR"),
        DRos(lambda_=500, estimator_name="DRos"),
    ],
)


squared_errors = ope.evaluate_performance_of_estimators(
    ground_truth_policy_value=policy_value_of_ipw,  # V(\pi_e)
    action_dist=action_dist_ipw_val,  # \pi_e(a|x)
    estimated_rewards_by_reg_model=estimated_rewards,  # \hat{q}(x,a)
    metric="se",  # squared error
)

ope_res = ope.summarize_off_policy_estimates(
    action_dist=action_dist_ipw_val,  # \pi_e(a|x)
    estimated_rewards_by_reg_model=estimated_rewards,  # \hat{q}(x,a)
)

print(squared_errors)
print(ope_res)
print(policy_value_of_ipw)
print(pv, pv_sample, pv_wl)

@fullflu
Copy link
Contributor Author

fullflu commented Nov 28, 2021

I remove weighted_loss option because classifier with weighted_loss is not well-trained (in terms of ROC-AUC and calibration plot of eval_result.)

@fullflu fullflu changed the title WIP Feature: Balanced-OPE estimators [Review] Feature: Balanced-OPE estimators Nov 28, 2021
@usaito
Copy link
Contributor

usaito commented Dec 7, 2021

@fullflu Thanks again for the great work!

The most important point I want to mention here is the definition of B-IPW.
In my understanding, we can simply use the importance weights estimated by the classification, and do not have to (re-)take the expectation over the evaluation policy in the estimator (as I made a suggestion above).

fullflu and others added 7 commits December 28, 2021 20:05

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
Co-authored-by: yuta-saito <32621707+usaito@users.noreply.github.com>
fullflu and others added 3 commits January 11, 2022 23:56

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
apply 2nd review

Co-authored-by: yuta-saito <32621707+usaito@users.noreply.github.com>
@usaito usaito merged commit 76a11b7 into st-tech:master Jan 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants