Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File-Augmented Schema #1151

Open
ethho opened this issue Feb 16, 2024 · 2 comments
Open

File-Augmented Schema #1151

ethho opened this issue Feb 16, 2024 · 2 comments
Assignees
Labels
enhancement features needs-discussion Issues requiring further development review and verify impact.
Milestone

Comments

@ethho
Copy link
Contributor

ethho commented Feb 16, 2024

Feature Request

Problem Statement

While the database provides data structure, efficient queries, and transaction support, files are still preferred for strong large objects such as images, numerical arrays, movies, etc. Users like to have direct read-only access to the files without mediation by the database. Storing large objects in MySQL tables has adverse performance effects on data queries.
DataJoint has previously implemented several approaches to address some aspects of this problem:

  1. Storing file paths as varchar strings with the user responsible for the file management.
  2. The attach and attach@store datatype to store files, preserving the filename but not the folder structures
  3. The blob@store datatype for storing serialized data structures in external files
  4. The filepath@store datatype to allow organizing files and folders under users' control
  5. The AdapatedType datatype that allows defining custom logic to apply for reading and writing.

In particular, the SpyGlass pipeline Loren Frank's lab relied on the filepath and AdaptedType features to implement NWB file management.
None of these methods simultaneously address the following desiderata:

  1. A logical, consistent file folder structure that's prescribed by DataJoint, based on the schema design and primary key values
  2. Keeping files in their original form and extension so that they can read and used outside DataJoint. Files should be accessible for reading without datajoint-python or DB access, and files should maintain their native file extensions and MIME types (as opposed to serializing into another format).
  3. Files are copied into their location and referenced in a single step as part of the insert and fetch operations.
  4. Files are deleted when the table entries referencing them are deleted
  5. Data consistency through transaction processing: inserts and deletes are executed as atomic transactions that can rollback when the transaction fails and where concurrent transactions do not lead to inconsistencies.

We need a solution for file management that simultaneously addresses all of these desiderata.

@ethho ethho added enhancement features needs-discussion Issues requiring further development review and verify impact. labels Feb 16, 2024
@ethho ethho added this to the DataJoint 1.0 milestone Feb 16, 2024
@ethho
Copy link
Contributor Author

ethho commented Feb 16, 2024

Initial Work: Consistent File Folder Structure

The first step is to design a consistent file folder structure that's prescribed by DataJoint, based on the schema design and primary key values. For the final solution, we need to consider the following:

  1. The file folder structure should be logical and consistent with the schema design and primary key values.
  2. The file folder structure should be designed to optimize the file system for efficient file access and management, including file search, file retrieval, and file deletion.

To this end, we have considered several classes of algorithms for generating file paths from primary keys:

1. Flat Key Space

  • Assign to each file a UUID: uuid = hash(schema + table + md5(contents))
    • The ID is unique across schemas and tables.
    • For example: the UUID could be Asdfkjb1234
  • Store files at a path that contains this UUID. Adopt one of the following strategies:
    1. Store all files (across schemas and tables) in a single directory on the external filesystem.
    2. Organize files hierarchically by their UUID, like Asdf/kjb/1234.mp4
  • A user could use a prescribed algorithm (e.g. generate_uuid_from_pkey()) to generate this unique key reproducibly, determine the file path, and fetch the file, without needing to query the database.

Pros

  • The file folder structure is consistent with the schema design and primary key values.

Cons

  • This design is not organized well enough from a UX standpoint. For example, it is not logical to a user who is inspecting the folder system with ls.
  • In order to determine the UUID and the expected file path, a user would need to use a helper function like generate_uuid_from_pkey(). Assuming that this function is packaged in datajoint-python, the user would need to have datajoint-python installed to determine the file path.

2. Hierarchical Key Space

  • Store files hierarchically depending on their schema name, table name, and primary key.
  • For example, file paths could be structured like /<schema-name>/<table-name>/<primary key attr 1>/<primary key attr 2>/< ... >/<last primary key attr>.mp4
  • Or as a corollary, we could include a hash of the file contents in the file path, creating a file path structured like /<schema-name>/<table-name>/<primary key attr 1>/<primary key attr 2>/< ... >/<last primary key attr>-<md5(file contents)>.mp4
  • What is the best hashing algorithm to use here?
    • More research into md5 vs other hashing algorithms is necessary.
    • rsync uses md5 by default.
    • md5 is not cryptographically secure, but it is fast and has a low collision rate.

cc: @dimitri-yatsenko

@dimitri-yatsenko dimitri-yatsenko self-assigned this Sep 13, 2024
@dimitri-yatsenko
Copy link
Member

This will be recategorized as "File-Augmented Schema"

@dimitri-yatsenko dimitri-yatsenko changed the title File Management System File-Augmented Schema Sep 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement features needs-discussion Issues requiring further development review and verify impact.
Projects
None yet
Development

No branches or pull requests

2 participants