Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python/r/c++] Revisit shape for component arrays #2407

Open
johnkerl opened this issue Apr 8, 2024 · 1 comment
Open

[python/r/c++] Revisit shape for component arrays #2407

johnkerl opened this issue Apr 8, 2024 · 1 comment

Comments

@johnkerl
Copy link
Member

johnkerl commented Apr 8, 2024

PRs

PRs in process:

PRs to be created:

  • RFC for simplified domain on DataFrame create
  • Audit on docstrings, vignettes, etc.
  • Example notebooks & other doc material -- be sure to link from the TileDB-SOMA 1.15 release notes

Merged PRs:

Closed/abandoned PRs:

Issues which are related but non-blocking:

See also: [sc-51048].

Problem to be solved

Users want to know the shape of an array, in the SciPy sense:

  • Reads and writes are bounds-checked against the shape
  • This retains its value regardless of which values of a sparse array are or are not actually occupied
  • Users can resize.
    • Some users need the ability to grow their datasets later, using either tiledbsoma.io's append mode, or subsequent writes using the tiledbsoma API.
    • Note that the cellxgene census doesn't need this: eact week's published census has fixed shape, and any updates will happen in new storage, on a new week.

Using TileDB-SOMA up until the present:

  • The TIleDB domain is immutable after array creation
    • This does bounds-checking for reads and writes, which is good
    • To leverage this to function as a shape, users would need to set the domain at array-creation time. However, users lose the ability to grow their datasets later.
  • There is a non_empty_domain accessor
    • This only indicates min/max coordinates at which data exists. Consider an X array for 100 cells and 200 genes. If non-zero expression counts exist only for cell join IDs 2-17, then the non_empty_domain will indicate (2,17) along soma_dim_0.
    • Consider an obms["X_pca"] within the same experiment. This may be 100 cells by 50 PCA components: we need a placd to store the number 50.
    • Therefore users cannot leverage this to function as a shape accessor.
  • We have offered a used_shape accessor since TileDB-SOMA 1.5.
    • This functions as a shape accessor, in the SciPy sense, but it is not multi-writer safe.

New feature for TileDB-SOMA 1.15:

  • Arrays will have a shape
  • Reads and writes are bounds-checked against the shape
  • This retains its value regardless of which values of a sparse array are or are not actually occupied
  • Users can resize
  • The used_shape accessor will be deprecated in TileDB-SOMA 1.13, and slated for removal in TileDB-SOMA 1.14.

Compatiblity:

This will now require users to do an explicit resize before appending/growing TileDB-SOMA Experiments. Guidance in the form of example notebooks will be provided.

Tracking

See also: [sc-41074] and [sc-51048].

Scheduling

Support arrives in TileDB Core 2.25. Deprecations for TileDB-SOMA will be released with 1.13. Full support within TileDB-SOMA will be release in 1.14.

Details

SOMA API mods as we've discussed in a Google doc are as follows.

SOMADataFrame

  • create: Retain the domain argument
    • Issue:
      • Core has a (lo, hi) tuple per dim, e.g. (0,99) or (10,19)
      • SOMA has count per dim, with 0 implicit: e.g. 100 or 20
      • For SparseNDArray and DenseNDArray core can have (lo, hi) and SOMA can have count
      • For DataFrame there can be multiple dims --- default is a single soma_joinid
      • That could be treated either in (lo, hi) fashion or count fashion
      • However additional dims (e.g. cell_type) can be on any type, including strings, floats, etc. where there is no implicit lo=0
      • Therefore we need to keep the current SOMA API wherein DataFrame takes a domain argument (in (lo, hi) fashion) and not a shape argument (in count fashion)

SparseNDArray and DenseNDArray

  • create
    • Have an optional shape argument which is of type Tuple[Int,...] where each element is the cell count of the corresponding dimension
      • If unsupplied, or if supplied but None in any slot: use the minimum 0 in each slot – nothing larger makes sense since we will not support downsize
    • User guidance should make clear that it will not be possible to create an ‘old’ style array with the ‘new style’ API. (See also the upgrade logic below.)

All three of SOMADataFrame, SparseNDArray, DenseNDArray

  • write
    • For new arrays, created with the new shape feature:
      • Core will bounds-check that coordinates provided at write time are within the current shape
      • Core will raise tiledb.cc.TileDBError to TileDB-SOMA, which will catch and raise IndexError, and R-standard behavior on the R side
    • For old arrays created before this feature:
      • Core will not bounds-check that coordinates provided at write time are within the current shape
  • Existing used_shape accessor
    • TileDB-SOMA will deprecate this over a release cycle.
    • For new arrays: raise NotImplementedError
    • For old arrays: return what’s currently returned, with a deprecation warning.
    • Mechanism for determining old vs. new: array.schema.version (the core storage version).
  • Existing shape accessor
    • For new arrays:
      • Have this return the new shape as proposed by core, no longer returning the TileDB domain.
    • For old arrays created before this feature:
      • Return the TileDB domain as now.
  • Existing non_empty_domain accessor
    • Same behavior for old and new arrays (unaffected by this proposal).
    • Keep this accessor supported, but, with user notes that it’s generally non-useful
    • This should return None (or R equivalent) when there is a schema but no data have been written.
  • New maxshape accessor
    • Maps the core-level (lo, hi) accessor for domain to count-style accessor hi+1. E.g. if the core domain is either (0,99) or (50,99) then TileDB-SOMA maxshape will say 100.
    • Same behavior for old and new arrays.
    • Let users query for what the TileDB domain is, with user notes that it’s the maximum that users can reshape to.
    • Issac suggests: maybe domain or maxshape (see h5py).
  • New resize mutator
    • Note: reshape means something else in the community (numpy, zarr, h5py), e.g. a 5x20 (total 100 cells) being reinterpreted as 4x25 (still 100 cells). The standard name for changing cell-count is resize.
    • For old arrays created before this feature: raise NotImplementedError.
    • For new arrays:
      • Will raise ValueError if the new shape is smaller on any dim than currently in storage
      • Regardless of whether any data have been written whatsoever
      • Will raise ValueError if the new shape exceeds the TileDB domain from create time, which will serve TileDB-SOMA in a role of “max possible shape the user can reshape to”
      • Otherwise, any calls to write from this point will bounds-check writes within this new shape
      • We don’t expect resize to be multi-writer safe with regard to write ; user notes must be clear on this point
  • New tiledbsoma_upgrade_shape method for SparseNDArray and DenseNDArray
    • This will leverage array.schema.version to see if an upgrade is needed
    • Leverage core support for storage-version updates
    • This will take a shape argument as in create
    • For arrays created with “just-right” size: this will succeed
    • For arrays created with “room-for-growth” / “two billion-ish” size: this will succeed
    • If the user passes a shape which exceeds the current TileDB domain: this will fail
  • New tiledbsoma_upgrade_domain method for DataFrame
    • Same as for SparseNDArray/DenseNDArray except it will take a domain at the SOMA-API level just as DataFrame's create method

tiledbsoma.io

  • The user-facing API has no shape arguments and thus won’t need changing.
  • Internally to tiledbsoma.io, we’ll still ask the tiledbsoma API for the “big domain” (2 billionish)
  • Append mode:
    • Will need a new resize method at the Experiment level
    • Users will need to:
      • Register as now
      • Call the experiment-level resize
        • Could be exp.resize(...), or (better) this could be tiledbsoma.io.reshape_experiment
    • In either case: this method will take the new obs and var counts as inputs:
      • exp.obs.reshape to new obs count
      • exp.ms[name].var.reshape to new var count
      • exp.ms[name].X[name].reshape to new obs count x var count
      • exp.ms[name].obsm[name].reshape to new obs count x same width
      • exp.ms[name].obsp[name].reshape to new obs count x obs count
      • exp.ms[name].varm[name].reshape to new var count x same width
      • exp.ms[name].varp[name].reshape to new var count x var count
    • Do the individual append-mode writes as now
@johnkerl johnkerl self-assigned this Apr 8, 2024
@johnkerl johnkerl changed the title [python/r/c++] Revisit shape for sparse arrays [python/r/c++] Revisit shape for sparse arrays [long-term tracker] Apr 8, 2024
@johnkerl johnkerl changed the title [python/r/c++] Revisit shape for sparse arrays [long-term tracker] [python/r/c++] Revisit shape for sparse arrays May 15, 2024
@johnkerl
Copy link
Member Author

#2785 is a quick-and-dirty concept-prover -- its sole function is to flush out any API misunderstandings we might have, in prep for 2.25.0 core release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants