Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question - Off Policy Eval Without Propensities #44

Open
AllardJM opened this issue Aug 19, 2024 · 6 comments
Open

Question - Off Policy Eval Without Propensities #44

AllardJM opened this issue Aug 19, 2024 · 6 comments

Comments

@AllardJM
Copy link

Thank you for this useful repo! I have a question, lets say there is logged data you want to use to train and evaluate a new policy. The logged data is something like <user and context features, action, reward> where the context features describe the user and the context of an internet application, action is the variant that was served to the user and reward is a click or no click indicator. There is no propensity of the action that is logged because it wasnt logged by a bandit, but instead a model where the "best" variant is always served or we just don't know why the variant was used, it just was.

Is there a way to compare the performance of the existing policy to a new one to see if a new one is worth A/B testing? Would you use Inverse Propensity Score (IPS), Direct Method (DM) and Doubly Robust (DR) and set the prob to 1? Is there a better way?

@mrucker
Copy link
Collaborator

mrucker commented Aug 22, 2024

It's always great to hear from someone using coba.

In your case I'd use direct method (DM). DM doesn't need propensity scores. You can see how accurate DM evaluation is here.

The first figure in that link shows the evaluation performance for IPS, DR, and DM compared to the ground truth performance (GT) of a learner. DM deviates the most from the GT performance (which makes sense because it has less information), but even so it is still within ~.01 of the GT. These lines are averages taken over 236 real world datasets. You can see exactly how much DM deviated from GT on every dataset in the second figure in section 4. You see here that DM can actually be off of ground truth by up to a value of 0.2 (where rewards are either 0 or 1).

So, I'd use this evaluator in your experiment to get DM: SequentialCB(learn='off', eval='dm'). If you have access to the baseline policy that you want to compare against you could even implement it as a learner and also pass it through DM evaluation. That might give you a more fair comparison since both learners will be using an equally biased reward estimate.

Hopefully that helps. I'm happy to answer more questions or talk through the specifics of your use case more.

@mrucker
Copy link
Collaborator

mrucker commented Aug 22, 2024

Oh, one more thing. I know you didn't ask, but nowadays those of us using coba have mostly moved to neural network learners with SquareCB exploration. If you have enough logged data, we've seen huge improvements in CB policy performance. I'm happy to share with you a coba notebook introducing some of the basic concepts with this approach. If you already have coba experiments running with your logged dataset, it's a near drop in replacement to test it out.

@AllardJM
Copy link
Author

In your case I'd use direct method (DM). DM doesn't need propensity scores.

Thank you for the suggestion. My understanding is that the DM method works like:

  • Take the logged / historical data and fit an ML model to predict the reward (e.g. 0 or 1 for clicks) 'r'. This outputs r-hat (x, a) where 'x' is the context and 'a' is the action.
  • The new policy needs to output a probability of choosing each arm for each context (like the VW examples where the scores out of the model are used in a PMF to select the action) are used to weight each possible action for the logged context.
  • The weighted average reward for each row of the logged data is summed and compared to the logged data rewards directly.

That sound right?

@AllardJM
Copy link
Author

Oh, one more thing. I know you didn't ask, but nowadays those of us using coba have mostly moved to neural network learners with SquareCB exploration. If you have enough logged data, we've seen huge improvements in CB policy performance. I'm happy to share with you a coba notebook introducing some of the basic concepts with this approach. If you already have coba experiments running with your logged dataset, it's a near drop in replacement to test it out.

Oh I would LOVE to review this! I am extremely interested in CB and trying to understand more about methodologies and hoping to apply in my work.

@AllardJM
Copy link
Author

@mrucker Hey Mark - Just checking in if you had the examples using SquareCB? I coded an example by hand to see how it worked but didnt find it that performant. Probably need some tweaks.

@rezaqorbani
Copy link

rezaqorbani commented Sep 27, 2024

Hello! @mrucker I am also interested in the coba notebook you mentioned. I would really appreciate if you could share it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants