[ENH] Distances: Optimize PearsonR/SpearmanR #2852

ales-erjavec · 2018-01-05T11:16:44Z

Issue

PearsonR is poorly implemented by a double for loop in python.
SpearmanR quadruples it's workload only to throw most of it away.

Description of changes

Use numpy.corrcoef for PearsonR
Implement a variants of corrcoef/spearmanr which compute correlations between two arrays more efficiently (without also computing all correlations within the two arrays).

Includes

Code changes
Tests
Documentation

codecov-io · 2018-01-11T08:52:21Z

Codecov Report

Merging #2852 into master will increase coverage by 0.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #2852      +/-   ##
==========================================
+ Coverage   81.91%   81.92%   +0.01%     
==========================================
  Files         326      326              
  Lines       55997    56031      +34     
==========================================
+ Hits        45868    45903      +35     
+ Misses      10129    10128       -1

* Use numpy.corrcoef in PearsonR * Optimize PearsonR/SpearmanR when computing pairwise distances on a single input table

... for the case where computing distances from two tables.

thocevar

This looks ok. Do you have any measurements of how much we gain with this re-implementation of what we previously handed over to numpy and scipy?

thocevar · 2018-02-23T14:10:19Z

Orange/distance/distance.py

+                rho = rho[:2, :2].copy()
+            else:
+                # scalar if n1 == 1
+                rho = stats.spearmanr(x1, axis=self.axis)[0]


Are these two cases (if, else) necessary? At first glance stats.spearmanr seems to (efficiently) handle the case of a missing second attribute.

Never mind.

ales-erjavec · 2018-02-26T08:59:20Z

In [1]: import Orange, numpy

In [2]: d = Orange.data.Table(numpy.random.random(size=(200, 200)))

In [3]: %timeit Orange.distance.PearsonR(d)

Before

1.59 s ± 12.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

After

614 µs ± 84.3 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [1]: import Orange, numpy

In [2]: d = Orange.data.Table(numpy.random.random(size=(400, 200)))

In [3]: %timeit Orange.distance.SpearmanR(d[1:], d[:-1])

Before

201 ms ± 2.28 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

After

47.9 ms ± 1.82 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

ales-erjavec force-pushed the distances-correlations-optimization branch from 0e915a6 to 1d5c34f Compare January 11, 2018 08:52

ales-erjavec added 3 commits January 16, 2018 14:24

distance: Speed and memory optimization

b526400

* Use numpy.corrcoef in PearsonR * Optimize PearsonR/SpearmanR when computing pairwise distances on a single input table

distances: Optimize PearsonR/SpearmanR ...

3d10937

... for the case where computing distances from two tables.

tests: Add tests for implementations of _corrcoef2 and _spearmanr2

f837920

ales-erjavec force-pushed the distances-correlations-optimization branch from 1d5c34f to f837920 Compare January 16, 2018 13:25

lanzagar added this to the 3.10 milestone Feb 7, 2018

thocevar self-assigned this Feb 23, 2018

lanzagar modified the milestones: 3.10, 3.11 Feb 23, 2018

thocevar reviewed Feb 23, 2018

View reviewed changes

thocevar merged commit b09b282 into biolab:master Feb 26, 2018

ales-erjavec deleted the distances-correlations-optimization branch February 26, 2018 17:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENH] Distances: Optimize PearsonR/SpearmanR #2852

[ENH] Distances: Optimize PearsonR/SpearmanR #2852

ales-erjavec commented Jan 5, 2018 •

edited

Loading

codecov-io commented Jan 11, 2018 •

edited

Loading

thocevar left a comment

thocevar Feb 23, 2018

thocevar Feb 23, 2018

ales-erjavec commented Feb 26, 2018

[ENH] Distances: Optimize PearsonR/SpearmanR #2852

[ENH] Distances: Optimize PearsonR/SpearmanR #2852

Conversation

ales-erjavec commented Jan 5, 2018 • edited Loading

Issue

Description of changes

Includes

codecov-io commented Jan 11, 2018 • edited Loading

Codecov Report

thocevar left a comment

Choose a reason for hiding this comment

thocevar Feb 23, 2018

Choose a reason for hiding this comment

thocevar Feb 23, 2018

Choose a reason for hiding this comment

ales-erjavec commented Feb 26, 2018

ales-erjavec commented Jan 5, 2018 •

edited

Loading

codecov-io commented Jan 11, 2018 •

edited

Loading