-
-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Naive Bayes is confounded by variable reuse. #2943
Comments
Variable reuse can be problematic in unexpected places and should be carefully (re)considered. But until something is changed on that front (might take some time?) we could/should change naive bayes to not use all variable values, but just those found in training data (as all other learners?). |
+1 for solution for NB. Is it true that variable reuse is needed only by the file widget, and all other uses may be problematic? We removed the option for reuse because users didn't understand it. What if we re-add it, but with a clear long explanation (perhaps visible only when relevant). Something like "This data seems to have the same variables as the data currently loaded by other data widget." And then a checkbox to confirm? This setting would appear/disappear and be synchronized in all simultaneously opened file widgets. How ugly is that? |
Fixed in #3575 |
Orange version
3.11.dev, 3.10, 3.9, 3.8 ...
Expected behavior
Can run learner performance evaluation sequentially in the same session on different datasets and get predictable deterministic results (not withstanding explicit randomness sources).
Actual behavior
Cannot run Test & Score on anneal.tab and glass.tab (both included in the distribution) sequentially without one affecting the other.
A by product of the Variable reuse.
Also related gh-2500
Steps to reproduce the behavior
In a fresh session load the anneal.tab in the File widget. Observe and record the reported AUC (10 fold cross validation, stratified, averaged over all classes); reload as many times as necessary to establish that the AUC is stable: 0.979.
Load glass.tab and then anneal.tab again. The AUC for Naive Bayes is different: 0.844.
Presumably the problem is in variable reuse. The first time anneal.tab is loaded y has values 1, 2, 3, 5, U the second time 1, 2, 3, 5, U, 6, 7 (the values from glass.tab#y are merged).
This is not a problem in general for skl-learners that use LabelEncoder to encode targets (and ignore the values not actually present in the target vector), but NaiveBayes uses the full y domain for Laplace estimation of class probabilities.
In Orange 2 at least there was on option to disable the variable reuse. Never the less, the default should be no reuse.
Additional info (worksheets, data, screenshots, ...)
The text was updated successfully, but these errors were encountered: