Question - Off Policy Eval Without Propensities #44

AllardJM · 2024-08-19T22:36:10Z

Thank you for this useful repo! I have a question, lets say there is logged data you want to use to train and evaluate a new policy. The logged data is something like <user and context features, action, reward> where the context features describe the user and the context of an internet application, action is the variant that was served to the user and reward is a click or no click indicator. There is no propensity of the action that is logged because it wasnt logged by a bandit, but instead a model where the "best" variant is always served or we just don't know why the variant was used, it just was.

Is there a way to compare the performance of the existing policy to a new one to see if a new one is worth A/B testing? Would you use Inverse Propensity Score (IPS), Direct Method (DM) and Doubly Robust (DR) and set the prob to 1? Is there a better way?

mrucker · 2024-08-22T01:03:56Z

It's always great to hear from someone using coba.

In your case I'd use direct method (DM). DM doesn't need propensity scores. You can see how accurate DM evaluation is here.

The first figure in that link shows the evaluation performance for IPS, DR, and DM compared to the ground truth performance (GT) of a learner. DM deviates the most from the GT performance (which makes sense because it has less information), but even so it is still within ~.01 of the GT. These lines are averages taken over 236 real world datasets. You can see exactly how much DM deviated from GT on every dataset in the second figure in section 4. You see here that DM can actually be off of ground truth by up to a value of 0.2 (where rewards are either 0 or 1).

So, I'd use this evaluator in your experiment to get DM: SequentialCB(learn='off', eval='dm'). If you have access to the baseline policy that you want to compare against you could even implement it as a learner and also pass it through DM evaluation. That might give you a more fair comparison since both learners will be using an equally biased reward estimate.

Hopefully that helps. I'm happy to answer more questions or talk through the specifics of your use case more.

mrucker · 2024-08-22T15:16:23Z

Oh, one more thing. I know you didn't ask, but nowadays those of us using coba have mostly moved to neural network learners with SquareCB exploration. If you have enough logged data, we've seen huge improvements in CB policy performance. I'm happy to share with you a coba notebook introducing some of the basic concepts with this approach. If you already have coba experiments running with your logged dataset, it's a near drop in replacement to test it out.

AllardJM · 2024-08-22T18:13:47Z

In your case I'd use direct method (DM). DM doesn't need propensity scores.

Thank you for the suggestion. My understanding is that the DM method works like:

Take the logged / historical data and fit an ML model to predict the reward (e.g. 0 or 1 for clicks) 'r'. This outputs r-hat (x, a) where 'x' is the context and 'a' is the action.
The new policy needs to output a probability of choosing each arm for each context (like the VW examples where the scores out of the model are used in a PMF to select the action) are used to weight each possible action for the logged context.
The weighted average reward for each row of the logged data is summed and compared to the logged data rewards directly.

That sound right?

AllardJM · 2024-08-22T18:15:10Z

Oh, one more thing. I know you didn't ask, but nowadays those of us using coba have mostly moved to neural network learners with SquareCB exploration. If you have enough logged data, we've seen huge improvements in CB policy performance. I'm happy to share with you a coba notebook introducing some of the basic concepts with this approach. If you already have coba experiments running with your logged dataset, it's a near drop in replacement to test it out.

Oh I would LOVE to review this! I am extremely interested in CB and trying to understand more about methodologies and hoping to apply in my work.

AllardJM · 2024-09-13T13:55:21Z

@mrucker Hey Mark - Just checking in if you had the examples using SquareCB? I coded an example by hand to see how it worked but didnt find it that performant. Probably need some tweaks.

rezaqorbani · 2024-09-27T09:25:53Z

Hello! @mrucker I am also interested in the coba notebook you mentioned. I would really appreciate if you could share it!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question - Off Policy Eval Without Propensities #44

Question - Off Policy Eval Without Propensities #44

AllardJM commented Aug 19, 2024

mrucker commented Aug 22, 2024 •

edited

Loading

mrucker commented Aug 22, 2024

AllardJM commented Aug 22, 2024

AllardJM commented Aug 22, 2024

AllardJM commented Sep 13, 2024

rezaqorbani commented Sep 27, 2024 •

edited

Loading

Question - Off Policy Eval Without Propensities #44

Question - Off Policy Eval Without Propensities #44

Comments

AllardJM commented Aug 19, 2024

mrucker commented Aug 22, 2024 • edited Loading

mrucker commented Aug 22, 2024

AllardJM commented Aug 22, 2024

AllardJM commented Aug 22, 2024

AllardJM commented Sep 13, 2024

rezaqorbani commented Sep 27, 2024 • edited Loading

mrucker commented Aug 22, 2024 •

edited

Loading

rezaqorbani commented Sep 27, 2024 •

edited

Loading