Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Add remove sparse features preprocessor #4093

Merged
merged 6 commits into from
Nov 14, 2019

Conversation

AndrejaKovacic
Copy link
Contributor

@AndrejaKovacic AndrejaKovacic commented Oct 10, 2019

Issue
Description of changes

Preprocessor widget now enables removing sparse features. Threshold is up to the user. I thought we could use the same icon as feature selection and cur decomposition.

Includes
  • Code changes
  • Tests
  • Documentation

@janezd
Copy link
Contributor

janezd commented Oct 10, 2019

I like the preprocessor, but do we have a better name for it? :)

I didn't realize what it does before seeing the code.

@codecov
Copy link

codecov bot commented Oct 10, 2019

Codecov Report

Merging #4093 into master will increase coverage by 0.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #4093      +/-   ##
==========================================
+ Coverage   85.75%   85.76%   +0.01%     
==========================================
  Files         389      389              
  Lines       69839    69899      +60     
==========================================
+ Hits        59892    59952      +60     
  Misses       9947     9947

"""

def __init__(self, treshold=0.05):
self.treshold = treshold
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to be this person, but perhaps for clarity and for maintaining a good API, I would correct the name to threshold.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. This, along with length is in my top misspelled words.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I always write 'widget' when I want to write 'width'. 🤷‍♀

@BlazZupan
Copy link
Contributor

I would support "sparse feature" in the name of the widget. One reason is that I cannot come up with a better one, and second because this term has been used in this context. For instance, googling '"sparse features" machine learning' returns 83.000 pages.

@ajdapretnar
Copy link
Contributor

This works well also on bag of words sparse type. I was wondering would it make sense to have also an int option. As in Max instances with nonzero values... 🤔 In the case of text mining this would then be similar to document frequency filter.

@AndrejaKovacic
Copy link
Contributor Author

Sounds good, we can add that in the second iteration of this preprocessor.

@markotoplak
Copy link
Member

@AndrejaKovacic, I added an icon. I reused something we already had in "Data". What do you think?

@AndrejaKovacic
Copy link
Contributor Author

I think it's fitting, better than select column.

@janezd
Copy link
Contributor

janezd commented Nov 11, 2019

@AndrejaKovacic, if you pull master and rebase, it should no longer crash on SOM. I apologize for inconvenience. :)

This is assigned to @BlazZupan. If UX is confirmed, I can check the code and merge.

h, w = data_csc.shape
sparsness = [data_csc[:, i].count_nonzero()/h for i in range(w)]
else:
sparsness = np.count_nonzero(data.X, axis=0)/ data.X.shape[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because you'll be rebasing and pushing anyway, you can also add a space before /, so it's balanced. Bonus points for putting spaces before and after / two lines above.

self.changed.emit()

def parameters(self):
return {'sparse_thresh':self.sparse_thresh}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A space after : would be nice, too.

def createinstance(params):
params = dict(params)
threshold = params.pop('sparse_thresh', 5)
return RemoveSparse(threshold=threshold/100)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And just a pair of spaces here and I'm merging - when it's rebased so it passes SOM.

@janezd janezd merged commit 0f730b4 into biolab:master Nov 14, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants