-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Interaction Grounded Learning
Note: this reduction is experimental and is subject to change
Paper: https://arxiv.org/pdf/2106.04887.pdf
VW's learning algorithm attempts to minimize loss, and the contextual bandit input format specifically calls for cost. However, in the setting on reinforcement learning and contextual bandits it is common to use reward in a label for a data point. And for rewards the agent wishes to maximize reward. Accidentally supplying a reward in place of a cost for contextual bandit label in VW will result in incorrect learning as minimizing this value is the opposite of what is intended.
This reduction tracks incoming labels and determines whether they are rewards or costs. Note that because positive value are assumed to be rewards and negative values costs, if your dataset is labelled such that positive values are still costs but are used to penalize the learner then this automatic translation will not work.
Internally it trains two models, one for rewards and and for costs. Or, more specifically it trains one model with the labels as is, and the other with the negated value. Then when making a prediction, the model used is selected based on if more rewards or costs have been seen thus far.
To enable this reduction use the option --igl
. This reduction expects contextual bandit data as input and produces action scores.
One important thing to note is that the loss calculation in VW when using this IGL reduction is not aware of dynamically selecting two different models and so if rewards are being used the loss calculation will be incorrect.
- Home
- First Steps
- Input
- Command line arguments
- Model saving and loading
- Controlling VW's output
- Audit
- Algorithm details
- Awesome Vowpal Wabbit
- Learning algorithm
- Learning to Search subsystem
- Loss functions
- What is a learner?
- Docker image
- Model merging
- Evaluation of exploration algorithms
- Reductions
- Contextual Bandit algorithms
- Contextual Bandit Exploration with SquareCB
- Contextual Bandit Zeroth Order Optimization
- Conditional Contextual Bandit
- Slates
- CATS, CATS-pdf for Continuous Actions
- Automl
- Epsilon Decay
- Warm starting contextual bandits
- Efficient Second Order Online Learning
- Latent Dirichlet Allocation
- VW Reductions Workflows
- Interaction Grounded Learning
- CB with Large Action Spaces
- CB with Graph Feedback
- FreeGrad
- Marginal
- Active Learning
- Eigen Memory Trees (EMT)
- Element-wise interaction
- Bindings
-
Examples
- Logged Contextual Bandit example
- One Against All (oaa) multi class example
- Weighted All Pairs (wap) multi class example
- Cost Sensitive One Against All (csoaa) multi class example
- Multiclass classification
- Error Correcting Tournament (ect) multi class example
- Malicious URL example
- Daemon example
- Matrix factorization example
- Rcv1 example
- Truncated gradient descent example
- Scripts
- Implement your own joint prediction model
- Predicting probabilities
- murmur2 vs murmur3
- Weight vector
- Matching Label and Prediction Types Between Reductions
- Zhen's Presentation Slides on enhancements to vw
- EZExample Archive
- Design Documents
- Contribute: