Naive Bayes is confounded by variable reuse. #2943

ales-erjavec · 2018-03-06T09:00:31Z

Orange version

3.11.dev, 3.10, 3.9, 3.8 ...

Expected behavior

Can run learner performance evaluation sequentially in the same session on different datasets and get predictable deterministic results (not withstanding explicit randomness sources).

Actual behavior

Cannot run Test & Score on anneal.tab and glass.tab (both included in the distribution) sequentially without one affecting the other.

A by product of the Variable reuse.

Also related gh-2500

Steps to reproduce the behavior

In a fresh session load the anneal.tab in the File widget. Observe and record the reported AUC (10 fold cross validation, stratified, averaged over all classes); reload as many times as necessary to establish that the AUC is stable: 0.979.

Load glass.tab and then anneal.tab again. The AUC for Naive Bayes is different: 0.844.

Presumably the problem is in variable reuse. The first time anneal.tab is loaded y has values 1, 2, 3, 5, U the second time 1, 2, 3, 5, U, 6, 7 (the values from glass.tab#y are merged).

This is not a problem in general for skl-learners that use LabelEncoder to encode targets (and ignore the values not actually present in the target vector), but NaiveBayes uses the full y domain for Laplace estimation of class probabilities.

import Orange
anneal = Orange.data.Table("anneal")
l1 = Orange.classification.NaiveBayesLearner()(anneal)
print(l1.class_prob)
_ = Orange.data.Table("glass")
l2 = Orange.classification.NaiveBayesLearner()(anneal)
print(l2.class_prob)

In Orange 2 at least there was on option to disable the variable reuse. Never the less, the default should be no reuse.

Additional info (worksheets, data, screenshots, ...)

lanzagar · 2019-02-01T10:34:32Z

Variable reuse can be problematic in unexpected places and should be carefully (re)considered.

But until something is changed on that front (might take some time?) we could/should change naive bayes to not use all variable values, but just those found in training data (as all other learners?).
Looks like class_freq is obtained with get_contingency so it might be as easy as filtering out class values with 0 representation after that is computed? Might need a fix in Model as well to not crash on unseen values?

janezd · 2019-02-01T11:14:49Z

+1 for solution for NB.

Is it true that variable reuse is needed only by the file widget, and all other uses may be problematic? We removed the option for reuse because users didn't understand it. What if we re-add it, but with a clear long explanation (perhaps visible only when relevant). Something like "This data seems to have the same variables as the data currently loaded by other data widget." And then a checkbox to confirm?

This setting would appear/disappear and be synchronized in all simultaneously opened file widgets. How ugly is that?

janezd · 2019-02-15T11:10:18Z

Fixed in #3575

janezd added bug A bug confirmed by the core team needs discussion Core developers need to discuss the issue labels Jan 24, 2019

janezd assigned BlazZupan, janezd and lanzagar Jan 31, 2019

janezd mentioned this issue Feb 1, 2019

CreateClass should create variables through make #2346

Closed

janezd unassigned BlazZupan, janezd and lanzagar Feb 1, 2019

janezd mentioned this issue Feb 2, 2019

[FIX] Naive Bayes: Ignore existing classes in Laplacian smoothing #3575

Merged

2 tasks

janezd closed this as completed Feb 15, 2019

lanzagar mentioned this issue Mar 29, 2019

Remove Variable.make #3705

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Naive Bayes is confounded by variable reuse. #2943

Naive Bayes is confounded by variable reuse. #2943

ales-erjavec commented Mar 6, 2018

lanzagar commented Feb 1, 2019

janezd commented Feb 1, 2019

janezd commented Feb 15, 2019

Naive Bayes is confounded by variable reuse. #2943

Naive Bayes is confounded by variable reuse. #2943

Comments

ales-erjavec commented Mar 6, 2018

Orange version

Expected behavior

Actual behavior

Steps to reproduce the behavior

Additional info (worksheets, data, screenshots, ...)

lanzagar commented Feb 1, 2019

janezd commented Feb 1, 2019

janezd commented Feb 15, 2019