Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Faster normalization #6202

Merged
merged 6 commits into from
Nov 21, 2022

Conversation

markotoplak
Copy link
Member

@markotoplak markotoplak commented Nov 14, 2022

Issue

Fixes #5219

Description of changes

The computation of distributions is slow (it works through contingencies and requires sorting).

This slowdown was especially problematic for dask tables.

The new code does not currently support weights! Supporting them would require making a weighted variance computation. Should we go there? Done in #6205.

Requires #6204 and #6205.

Testing performance on 1000x1000 tables.

Before:

[normalize_only_parameters] with 5 loops, best of 3:
	min 233 msec per loop
	avg 235 msec per loop
[normalize_only_transform] with 5 loops, best of 3:
	min 120 msec per loop
	avg 121 msec per loop

After:

[normalize_only_parameters] with 5 loops, best of 3:
	min 71.7 msec per loop
	avg 73.3 msec per loop
[normalize_only_transform] with 5 loops, best of 3:
	min 117 msec per loop
	avg 119 msec per loop

The mean/variance estimation part is 3x faster. Of course, this PR has no influence on the actual transformation (the domain contains the same variables).

Includes
  • Code changes
  • Tests
  • Documentation

@markotoplak markotoplak added the dask Related (discovered in or needed) to the Dask adaptation label Nov 14, 2022
@markotoplak markotoplak marked this pull request as draft November 14, 2022 14:06
@codecov
Copy link

codecov bot commented Nov 14, 2022

Codecov Report

Merging #6202 (d212f84) into master (7d18600) will decrease coverage by 0.00%.
The diff coverage is 96.66%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6202      +/-   ##
==========================================
- Coverage   86.66%   86.66%   -0.01%     
==========================================
  Files         316      316              
  Lines       68021    68027       +6     
==========================================
+ Hits        58951    58955       +4     
- Misses       9070     9072       +2     

@markotoplak markotoplak force-pushed the normalize-basic-stats branch 3 times, most recently from 99cd396 to dcebac3 Compare November 17, 2022 12:24
@markotoplak markotoplak changed the title [ENH] Faster normalization [NOMERGE][ENH] Faster normalization Nov 17, 2022
@markotoplak markotoplak marked this pull request as ready for review November 17, 2022 13:56
@markotoplak markotoplak changed the title [NOMERGE][ENH] Faster normalization [ENH] Faster normalization Nov 18, 2022
@pavlin-policar
Copy link
Collaborator

Looks good to me.

@pavlin-policar pavlin-policar merged commit 9f83570 into biolab:master Nov 21, 2022
@markotoplak markotoplak deleted the normalize-basic-stats branch November 21, 2022 12:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dask Related (discovered in or needed) to the Dask adaptation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Normalization is inefficient in memory and time
2 participants