-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question - Off Policy Eval Without Propensities #44
Comments
It's always great to hear from someone using coba. In your case I'd use direct method (DM). DM doesn't need propensity scores. You can see how accurate DM evaluation is here. The first figure in that link shows the evaluation performance for IPS, DR, and DM compared to the ground truth performance (GT) of a learner. DM deviates the most from the GT performance (which makes sense because it has less information), but even so it is still within ~.01 of the GT. These lines are averages taken over 236 real world datasets. You can see exactly how much DM deviated from GT on every dataset in the second figure in section 4. You see here that DM can actually be off of ground truth by up to a value of 0.2 (where rewards are either 0 or 1). So, I'd use this evaluator in your experiment to get DM: Hopefully that helps. I'm happy to answer more questions or talk through the specifics of your use case more. |
Oh, one more thing. I know you didn't ask, but nowadays those of us using coba have mostly moved to neural network learners with SquareCB exploration. If you have enough logged data, we've seen huge improvements in CB policy performance. I'm happy to share with you a coba notebook introducing some of the basic concepts with this approach. If you already have coba experiments running with your logged dataset, it's a near drop in replacement to test it out. |
Thank you for the suggestion. My understanding is that the DM method works like:
That sound right? |
Oh I would LOVE to review this! I am extremely interested in CB and trying to understand more about methodologies and hoping to apply in my work. |
@mrucker Hey Mark - Just checking in if you had the examples using SquareCB? I coded an example by hand to see how it worked but didnt find it that performant. Probably need some tweaks. |
Hello! @mrucker I am also interested in the coba notebook you mentioned. I would really appreciate if you could share it! |
Thank you for this useful repo! I have a question, lets say there is logged data you want to use to train and evaluate a new policy. The logged data is something like <user and context features, action, reward> where the context features describe the user and the context of an internet application, action is the variant that was served to the user and reward is a click or no click indicator. There is no propensity of the action that is logged because it wasnt logged by a bandit, but instead a model where the "best" variant is always served or we just don't know why the variant was used, it just was.
Is there a way to compare the performance of the existing policy to a new one to see if a new one is worth A/B testing? Would you use Inverse Propensity Score (IPS), Direct Method (DM) and Doubly Robust (DR) and set the prob to 1? Is there a better way?
The text was updated successfully, but these errors were encountered: