-
-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH] Improved Sparsity Handling #2341
[ENH] Improved Sparsity Handling #2341
Conversation
4b65c4e
to
80475e7
Compare
That PR should fix #2359 . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When someone wants to transform two columns from sparse X to dense Y she or he gets only one Y column which has got double the number of rows.
table = Table("iris")
table.X = sp.csr_matrix(table.X)
attributes = table.domain.attributes[:2] + table.domain.class_vars
class_vars = table.domain.attributes[2:4]
newtable = table.transform(Domain(attributes=attributes, class_vars=class_vars))
self.assertEqual(newtable.Y.shape[0], table.Y.shape[0]) |
125a142
to
71fc394
Compare
@jerneju nice catch. Should be fixed not, please check. |
Orange/tests/test_domain.py
Outdated
@@ -463,6 +463,42 @@ def test_domain_conversion_is_fast_enough(self): | |||
|
|||
self.assertLessEqual(time() - start, 1) | |||
|
|||
def test_domain_conversion_sparsity(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I usually add something like this:
"""
Description.
GH-2341
"""
942f527
to
120f92e
Compare
Orange/data/variable.py
Outdated
.. attribute:: sparse | ||
|
||
A flag about sparsity of the variable. When set, the variable suggests | ||
it is should be stored in a sparse matrix. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-is
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed 😄
4d20017
to
c23bf8d
Compare
Codecov Report
@@ Coverage Diff @@
## master #2341 +/- ##
=========================================
+ Coverage 76.09% 76.1% +0.01%
=========================================
Files 338 338
Lines 59723 59760 +37
=========================================
+ Hits 45444 45479 +35
- Misses 14279 14281 +2 |
8488448
to
68594d1
Compare
67c4efb
to
1d9c77a
Compare
Orange/data/domain.py
Outdated
@@ -63,6 +75,19 @@ def __init__(self, source, destination): | |||
source.index(var) if var in source | |||
else var.compute_value for var in destination.metas] | |||
|
|||
def is_sparse(feats): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should_be_sparse?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That shoulds better 👍 Fixed.
Orange/data/table.py
Outdated
@@ -280,45 +280,77 @@ def from_table(cls, domain, source, row_indices=...): | |||
|
|||
global _conversion_cache | |||
|
|||
def get_columns(row_indices, src_cols, n_rows, dtype=np.float64, | |||
is_sparse=False): | |||
def array_transform(x, to_sparse): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could these functions be moved someplace else? They seem like general functions that ensure array/column is sparse/dense. Maybe change to a more expressive name.
Orange/data/domain.py
Outdated
def is_sparse(feats): | ||
""" For a matrix to be stored in sparse, more than 2/3 of columns | ||
should be marked as sparse and there should be no string columns | ||
since Scipy's sparse matrices don't support dtype=object.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least the closing triple quotes of a multi-line docstring should be in their own line (PEP257).
(for opening quotes try to mimic other docstrings nearby)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to match the style as in nerby docstrings.
Orange/data/domain.py
Outdated
""" For a matrix to be stored in sparse, more than 2/3 of columns | ||
should be marked as sparse and there should be no string columns | ||
since Scipy's sparse matrices don't support dtype=object.""" | ||
fraction_sparse = sum(f.sparse for f in feats)/max(len(feats), 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
space around operators
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed :)
1d9c77a
to
20ae6d6
Compare
@astaric, @markotoplak I split the functionality of I haven't moved the functions yet, though. I will do this once we settle on the best approach. Perhaps to some util library? |
e4affcb
to
4aa7120
Compare
DomainConversion now has three more attributes `sparse_X`, `sparse_Y` and `sparse_metas`, which suggest whether the resulting matrix should be sparse or dense.
Table.from_table now sets the sparsity of X, Y and metas as determined by DomainConversion.
When we return subarryas, the flag `is_sparse` wasn't considered, but we simpy returned the subarray in it's original format. Also, make sure subarrays aren't flattened to 1d, as it is required for columns.
Calling len on scipy sparse matrices causes the error: `TypeError: sparse matrix length is ambiguous; use getnnz() or shape[0]`.
... to alleviate interdependencies between tests in differente files. Before that, one tests could mark some attributes of iris as sparse which could cause tests in different files to crash due to variable "reusing".
4aa7120
to
2b8d9a8
Compare
2b8d9a8
to
4bf6a20
Compare
Issue
Fixes #2322.
Blocked on #2734. Blocked on #2736.
Description of changes
DomainConversion
now has three extra attributes telling whether X, Y, or metas should be sparse according to the features inside of them. Metas can only be sparse if no StringVariables are present.Table.from_table
determines the density of X, Y, and metas according to domain conversion.to_sparse
andto_dense
methods toTable
.Includes