-
Notifications
You must be signed in to change notification settings - Fork 91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Review] Feature: Balanced-OPE estimators #146
Conversation
ObservationInitial implementation of a self-normalized balanced ipe estimator tells us the following fact (I will commit a notebook later):
Discussion
Direction
|
OPE script (stochastic policy)
|
I remove |
@fullflu Thanks again for the great work! The most important point I want to mention here is the definition of B-IPW. |
Co-authored-by: yuta-saito <32621707+usaito@users.noreply.github.com>
…importance weight estimator
apply 2nd review Co-authored-by: yuta-saito <32621707+usaito@users.noreply.github.com>
Overview
Tasks
estimated_pscore
to ope estimatorsimportance_sampling_ratio
to ope estimatorsevaluation_policy_sample_method
using various types of classifiers and hyperparametersDetails
Self-normalized Balanced IPS
It is not trivial how to handle action features of a stochastic evaluation policy. I implement two types of method by
evaluation_policy_sample_method
.raw
:action_dist_at_position
are directly encoded as action features.sample
: actions are sampled.3.weighted_loss
: for each round,n_actions
rows are duplicated. For each duplicated row, action features are represented by one-hot encoding of each action. Classification models are trained withsample_weight
, wheresample_weight
is the probability that the corresponding action is sampled (action_dist_at_position[:, action_idx]
).Reference
https://arxiv.org/abs/1906.03694