Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FIX] Only deepcopy the .attributes for the outermost Table transformation #6849

Merged
merged 5 commits into from
Jul 26, 2024

Conversation

markotoplak
Copy link
Member

@markotoplak markotoplak commented Jul 8, 2024

Issue

#6189 introduced deep-copying of the dictionary Table.attributes at table transformations. Because table transformations often happen withing table transformation, this can lead to wasting computation and memory. An example of the bug that can happen is Quasars/orange-spectroscopy/pull/726.

For illustration, here is a small example:

import numpy as np
import Orange
from Orange.projection import PCA
from mem_use import MemUse

mem = MemUse()
mem.start()

data = Orange.data.Table(np.random.random((1000, 1000)))
data.attributes["big"] = [ i for i in range(100000) ]
mem.wait_gc("loaded data")

p = PCA(n_components=10)(data)

mem.wait_gc("pca")
mem.draw()

The memory use/time graphs below (first master, then this branch) show big differences (~30 seconds vs 1 second) that all stem from the wasteful use of deepcopy:

master
this-branch

Description of changes

The code now only copies .attributes only for the outermost transformations.

And the thing that should be discussed. This PR The previous version of the PR would introduce problems if deeper internal transformations wanted to read table.attributes. I intentionally left it so just to see if there are any problems. I could add support for them by just referencing that dict for non internal transformations, but if any were naughty and actually modified table.attributes, we'd have problems.

I discussed above with myself and noticed that there are no mechanisms for compute_values to set output table.attributes and that even before this PR, if any compute_value was particularly naughty, there were problems. The following test fails on master.

def test_attributes_with_naughty_compute_values(self):
    self.table.attributes["A"] = "Original"
    orig = copy.deepcopy(self.table.attributes)

    def changes_attributes(data: Table):
        self.table.attributes["A"] = "Changed"
        return data.get_column(data.domain.attributes[0])

    ndom = Domain([ContinuousVariable("naughty", compute_value=changes_attributes)])
    new_table = self.table.from_table(ndom, self.table)

    self.assertEqual(orig, new_table.attributes)

The last commit therefore ensures that .attributes are always set, but yes, as before, they should not be modified within the compute_value.

Includes
  • Code changes
  • Tests
  • Documentation

Copy link

codecov bot commented Jul 8, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 88.26%. Comparing base (910d625) to head (4460fbb).
Report is 75 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #6849   +/-   ##
=======================================
  Coverage   88.26%   88.26%           
=======================================
  Files         326      326           
  Lines       71146    71151    +5     
=======================================
+ Hits        62796    62801    +5     
  Misses       8350     8350           

@JakaKokosar JakaKokosar merged commit 09df730 into biolab:master Jul 26, 2024
20 of 31 checks passed
@markotoplak markotoplak deleted the deepcopy-outermost branch July 30, 2024 11:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants