Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FIX] Avoid slowdowns by caching Domain.__eq__ #6764

Merged
merged 5 commits into from
Aug 23, 2024

Conversation

markotoplak
Copy link
Member

@markotoplak markotoplak commented Mar 19, 2024

Issue

Per Quasars/orange-spectroscopy#713 domain comparison can be so slow that it seems that Orange crashed. They are slow especially when multiple domain transformations are chained and if there are many attributes.

A demo to reproduce the issue follows. This __eq__ implementation follows how we implemented __eq__ for PCA and PLS, but there, fortunately, because scikit-learn does not properly implement __eq__s we do not see this issue.

What follows is the simplest SharedComputeValue with __eq__ that ensures that the input domain is the same (as all our domain transformations).

import time
import numpy as np

from Orange.data import Domain, Table
from Orange.data.util import SharedComputeValue


class SelectColumn(SharedComputeValue):

    def __init__(self, feature, commonfn):
        super().__init__(commonfn)
        self.feature = feature

    def compute(self, data, common):
        return common[:, self.feature]

    def __eq__(self, other):
        return super().__eq__(other) and self.feature == other.feature

    def __hash__(self):
        return hash((super().__hash__(), self.feature))


class IncrementAll:

    def __init__(self, domain):
        self.domain = domain

    def __call__(self, data):
        if data.domain != self.domain:
            data = data.transform(self.domain)
        return data.X + 1

    def __eq__(self, other):
        return type(self) is type(other) and self.domain == other.domain

    def __hash__(self):
        return hash((type(self), self.domain))


def increment(data):
    common = IncrementAll(data.domain)
    atts = [a.copy(compute_value=SelectColumn(i, common))
            for i, a in enumerate(data.domain.attributes)]
    domain = Domain(atts, data.domain.class_vars,
                                data.domain.metas)
    return data.transform(domain

d = Table.from_numpy(None, X=np.random.random((10, 100)))
d1 = d2 = d

for i in range(5):
    t = time.time()
    d1 = increment(d1)
    d2 = increment(d2)
    assert d1.domain == d2.domain
    print("depth:", i+1, "  time:", time.time() - t)

On this branch, I see:

depth: 1   time: 0.09512972831726074
depth: 2   time: 0.006474971771240234
depth: 3   time: 0.006785392761230469
depth: 4   time: 0.006263017654418945
depth: 5   time: 0.006262540817260742

While on master I get (each increase in depth makes it 100 times slower):

depth: 1   time: 0.09968352317810059
depth: 2   time: 0.04008889198303223
depth: 3   time: 3.431462526321411
depth: 4   time: 356.5397837162018
depth: 5   I STOPPED IT
Description of changes

I implemented caching in Domain.__eq__. We deem domains read-only and already cache __hash__, so why not this one too?

The first commit refactors id-based caching used in table.py for use elsewhere.

Includes
  • Code changes
  • Tests
  • Documentation

Copy link

codecov bot commented Jul 24, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 88.27%. Comparing base (09df730) to head (eee8618).

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #6764   +/-   ##
=======================================
  Coverage   88.26%   88.27%           
=======================================
  Files         326      326           
  Lines       71142    71178   +36     
=======================================
+ Hits        62797    62830   +33     
- Misses       8345     8348    +3     

@markotoplak markotoplak changed the title Cache __eq__ in Domain [FIX] Avoid slowdowns by caching Domain.__eq__ Jul 26, 2024
@markotoplak markotoplak force-pushed the domain-cache-eq branch 2 times, most recently from 975fa5e to b87741e Compare July 26, 2024 12:44
@markotoplak markotoplak marked this pull request as ready for review July 26, 2024 12:48
@markotoplak
Copy link
Member Author

This is now ready for review.

@markotoplak
Copy link
Member Author

This PR is ready for review. :)

@janezd janezd merged commit b981d76 into biolab:master Aug 23, 2024
24 of 31 checks passed
@janezd
Copy link
Contributor

janezd commented Aug 23, 2024

Your modifications look OK (I like the dict with max 10 items!), and tests pass, so I merged the PR.

But I had my fingers tightly crossed: with a change so deep down, there may be problems we haven't thought about. On the other hand, if something was wrong in such important part of the code, it would for sure trip some test (beyond those where the test itself was incorrect).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants