Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cell array of arrays of doubles cannot be fetched in python, only in matlab #1098

Open
renanmcosta opened this issue Jul 6, 2023 · 8 comments
Assignees
Labels

Comments

@renanmcosta
Copy link

renanmcosta commented Jul 6, 2023

Bug Report

Description

Fetching fails in python when each entry for a given attribute (defined in matlab) is a cell array, and each element of the cell array is an array of doubles. Fetching in matlab works as expected.

Reproducibility

Windows, Python 3.9.13, DataJoint 0.13.8

Steps:

  1. Define and populate table in matlab containing an attribute such as:
    epoch_pos_range=null : blob # list of y position ranges corresponding to n epochs in epoch_list, (e.g., {[y_on y_off],[y_on y_off]} for epoch_list {'epoch1','epoch2'})
  2. Fetch in matlab (works as intended)
  3. Attempt to fetch in python (throws a reshaping error for the array)

Error stack:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
 in 
----> 1 VM['opto'].OptoSession.fetch('epoch_pos_range')

[c:\Users\admin\.conda\envs\sandbox\lib\site-packages\datajoint\fetch.py](file:///C:/Users/admin/.conda/envs/sandbox/lib/site-packages/datajoint/fetch.py) in __call__(self, offset, limit, order_by, format, as_dict, squeeze, download_path, *attrs)
    227             attributes = [a for a in attrs if not is_key(a)]
    228             ret = self._expression.proj(*attributes)
--> 229             ret = ret.fetch(
    230                 offset=offset,
    231                 limit=limit,

[c:\Users\admin\.conda\envs\sandbox\lib\site-packages\datajoint\fetch.py](file:///C:/Users/admin/.conda/envs/sandbox/lib/site-packages/datajoint/fetch.py) in __call__(self, offset, limit, order_by, format, as_dict, squeeze, download_path, *attrs)
    287                 for name in heading:
    288                     # unpack blobs and externals
--> 289                     ret[name] = list(map(partial(get, heading[name]), ret[name]))
    290                 if format == "frame":
    291                     ret = pandas.DataFrame(ret).set_index(heading.primary_key)

[c:\Users\admin\.conda\envs\sandbox\lib\site-packages\datajoint\fetch.py](file:///C:/Users/admin/.conda/envs/sandbox/lib/site-packages/datajoint/fetch.py) in _get(connection, attr, data, squeeze, download_path)
    108         if attr.uuid
    109         else (
--> 110             blob.unpack(
    111                 extern.get(uuid.UUID(bytes=data)) if attr.is_external else data,
    112                 squeeze=squeeze,

[c:\Users\admin\.conda\envs\sandbox\lib\site-packages\datajoint\blob.py](file:///C:/Users/admin/.conda/envs/sandbox/lib/site-packages/datajoint/blob.py) in unpack(blob, squeeze)
    603         return blob
    604     if blob is not None:
--> 605         return Blob(squeeze=squeeze).unpack(blob)

[c:\Users\admin\.conda\envs\sandbox\lib\site-packages\datajoint\blob.py](file:///C:/Users/admin/.conda/envs/sandbox/lib/site-packages/datajoint/blob.py) in unpack(self, blob)
    127         blob_format = self.read_zero_terminated_string()
    128         if blob_format in ("mYm", "dj0"):
--> 129             return self.read_blob(n_bytes=len(self._blob) - self._pos)
    130 
    131     def read_blob(self, n_bytes=None):

[c:\Users\admin\.conda\envs\sandbox\lib\site-packages\datajoint\blob.py](file:///C:/Users/admin/.conda/envs/sandbox/lib/site-packages/datajoint/blob.py) in read_blob(self, n_bytes)
    161                 % data_structure_code
    162             )
--> 163         v = call()
    164         if n_bytes is not None and self._pos - start != n_bytes:
    165             raise DataJointError("Blob length check failed! Invalid blob")

[c:\Users\admin\.conda\envs\sandbox\lib\site-packages\datajoint\blob.py](file:///C:/Users/admin/.conda/envs/sandbox/lib/site-packages/datajoint/blob.py) in read_cell_array(self)
    493         return (
    494             self.squeeze(
--> 495                 np.array(result).reshape(shape, order="F"), convert_to_scalar=False
    496             )
    497         ).view(MatCell)

ValueError: cannot reshape array of size 4 into shape (1,2)
@renanmcosta renanmcosta added the bug label Jul 6, 2023
@kabilar
Copy link
Contributor

kabilar commented Jul 8, 2023

Thanks for the report, @renanmcosta. Typically the MATLAB cell array gets properly packed and unpacked. We have not encountered the error that you reported. We will investigate further and get back to you.

@dimitri-yatsenko dimitri-yatsenko self-assigned this Jul 17, 2023
@renanmcosta
Copy link
Author

renanmcosta commented Jul 25, 2023

For now I've managed to fetch with the temporary fix below. I don't think it's very robust, but I'm copying it here in case it's informative.

def read_cell_array(self):
        """deserialize MATLAB cell array"""
        n_dims = self.read_value()
        shape = self.read_value(count=n_dims)
        n_elem = int(np.prod(shape))
        result = [self.read_blob(n_bytes=self.read_value()) for _ in range(n_elem)]
        if n_elem != len(np.ravel(result, order="F")): # if not all elements are scalars. shouldn't work for ragged arrays
            shape = (-1,) + tuple(shape[1:n_dims])
        return (
            self.squeeze(
                np.array(result).reshape(shape, order="F"), convert_to_scalar=False
            )
        ).view(MatCell)

@Paschas
Copy link

Paschas commented Mar 8, 2024

Greetings,

I have just encountered the same problem, and temp fix seems to work (Thanks a lot @renanmcosta)


Temporary fix returns an array but with shape = (537000, 2).
In matlab its an 1×2 cell array {10×5370×10 single} {10×5370×10 single}.

type(temp_fixed) --> datajoint.blob.MatCell


Am I able to retrieve the original dimensions or this is a robustness problem of the temporary fix?

Thanks in advance

@dimitri-yatsenko
Copy link
Member

Hi @Paschas, could you update us on this? We are looking to resolve this.

@renanmcosta
Copy link
Author

Greetings,

I have just encountered the same problem, and temp fix seems to work (Thanks a lot @renanmcosta)

Temporary fix returns an array but with shape = (537000, 2). In matlab its an 1×2 cell array {10×5370×10 single} {10×5370×10 single}.

type(temp_fixed) --> datajoint.blob.MatCell

Am I able to retrieve the original dimensions or this is a robustness problem of the temporary fix?

Thanks in advance

The temp fix is responsible for the shape differences there. Lately, I have been using a simpler fix, which shouldn't collapse any dimensions. This is one should always work, though it's possible that it can lead to awkward array nesting at times.

def fix_cell_array_fetch():
    """Fixes bug that prevents cell arrays from being fetched in python in certain
    cases. Replaces cell array unpacking method in the datajoint module with working
    version.
    """

    class Blob(dj.blob.Blob):
        def read_cell_array(self):
            """deserialize MATLAB cell array"""
            n_dims = self.read_value()
            shape = self.read_value(count=n_dims)
            n_elem = int(np.prod(shape))
            result = [self.read_blob(n_bytes=self.read_value()) for _ in range(n_elem)]
            return (
                self.squeeze(np.array(result, dtype="object"), convert_to_scalar=False)
            ).view(dj.blob.MatCell)

    dj.blob.Blob = Blob

@dimitri-yatsenko
Copy link
Member

Let's see if we can incorporate this in this coming release.

@Paschas
Copy link

Paschas commented Sep 19, 2024

Greetings @dimitri-yatsenko & @renanmcosta

Without @renanmcosta's fixes I used to get 2 types of error:

in Blob.read_cell_array(self)
    [493] n_elem = int(np.prod(shape))
    [494] result = [self.read_blob(n_bytes=self.read_value()) for _ in range(n_elem)]
    [495] return (
    [496]     self.squeeze(
    [497]         #np.array(result).reshape(shape, order="F"), convert_to_scalar=False
    [498]         #np.array(result).reshape(shape, order="C"), convert_to_scalar=False
--> [499]         np.array(result).reshape(shape, order="A"), convert_to_scalar=False
    [500]
    [501])
    [502].view(MatCell)

ValueError: cannot reshape array of size 2560 into shape (1,10)

or

ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (4,) + inhomogeneous part.

The fix_cell_array_fetch() is working but I would be cautious (Thanks again @renanmcosta )

In different but similar occasion arrays had the correct shape but data were shuffled, eventually a changed the following:

# line 243 of blob.py
 def read_array(self):
        .....        
       # Changed Nothing
        .....        
        return self.squeeze(data.reshape(shape, order="C"))  # It was F

@renanmcosta
Copy link
Author

We just found a new case where the latest approach I posted above still raises a ValueError, e.g.:
ValueError: could not broadcast input array from shape (3,) into shape (1,)
It happens when the first dimension of each entry is the same, and appears to be a limitation of numpy (discussion).
Ultimately the problem is that MATLAB cell arrays and numpy arrays are intended as different types of objects, and as a result MATLAB cell arrays can be ragged in ways that numpy is unwilling to support.
Here's my current solution, which should hopefully retain the structure of each entry:

class fixed_Blob(dj.blob.Blob):
    def read_cell_array(self):
        """deserialize MATLAB cell array"""
        n_dims = self.read_value()
        shape = self.read_value(count=n_dims)
        n_elem = int(np.prod(shape))
        result = [self.read_blob(n_bytes=self.read_value()) for _ in range(n_elem)]
        arr = np.empty(n_elem, dtype="object")
        arr[:] = result
        return (self.squeeze(arr, convert_to_scalar=False)).view(dj.blob.MatCell)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: 📋 Backlog
Development

No branches or pull requests

4 participants