New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Add HERD gallery to pynwb with streaming #1781

Draft

mavaylon1 wants to merge 1 commit into dev from herd_gallery

Collaborator

mavaylon1 commented Oct 7, 2023 •

edited

Loading

Motivation

What was the reasoning behind this change? Please explain the changes briefly.

Now that HERD is in pynwb, it should have its own gallery. It should also include an example that handles streaming a dandiset in order to add metadata annotations.

Add small description of the dandiset
Replace the HerdManager with NWBFile
Update the examples to use pynwb objects to make it more practical

How to test the behavior?

Show how to reproduce the new behavior (can be a bug fix or a new feature)

Run the doc.

Checklist

Did you update CHANGELOG.md with your changes?
Have you checked our Contributing document?
Have you ensured the PR clearly describes the problem and the solution?
Is your contribution compliant with our coding style? This can be checked running flake8 from the source directory.
Have you checked to ensure that there aren't other open Pull Requests for the same change?
Have you included the relevant issue number using "Fix #XXX" notation where XXX is the issue number? By including "Fix #XXX" you allow GitHub to close issue #XXX when the PR is merged.


          Add HERD gallery to pynwb with streaming

52e7a28

mavaylon1 self-assigned this

mavaylon1 added category: enhancement priority: medium topic: docs labels

codecov bot commented Oct 7, 2023 •

edited

Loading

Codecov Report

Merging #1781 (52e7a28) into dev (2aceed0) will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##              dev    #1781   +/-   ##
=======================================
  Coverage   91.75%   91.75%           
=======================================
  Files          27       27           
  Lines        2619     2619           
  Branches      684      684           
=======================================
  Hits         2403     2403           
  Misses        141      141           
  Partials       75       75

Flag	Coverage Δ
integration	`71.09% <ø> (ø)`
unit	`83.46% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

mavaylon1 mentioned this pull request

Update streaming.py #1780

Merged

6 tasks

oruebel reviewed

View reviewed changes

docs/gallery/general/plot_resources.py

Comment on lines +387 to +396

+                              # Add HERD for Experimenter
+                              entity = herd.get_entity(entity_id='0000-0001-6782-3819')
+                              entity_uri = None if entity is not None else 'https://orcid.org/0000-0001-6782-3819'
+                              herd.add_ref(file=read_file,
+                                           container=read_file,
+                                           attribute="experimenter",
+                                           key=read_file.experimenter[0],
+                                           entity_id = '0000-0001-6782-3819',
+                                           entity_uri = entity_uri
+                                           )

Contributor

oruebel Oct 8, 2023

Suggested change

      
                            # Add HERD for Experimenter
          
                            entity = herd.get_entity(entity_id='0000-0001-6782-3819')
          
                            entity_uri = None if entity is not None else 'https://orcid.org/0000-0001-6782-3819'
          
                            herd.add_ref(file=read_file,
          
                                         container=read_file,
          
                                         attribute="experimenter",
          
                                         key=read_file.experimenter[0],
          
                                         entity_id = '0000-0001-6782-3819',
          
                                         entity_uri = entity_uri
          
                                         )
          
                            # Add HERD for Experimenter
          
                            entity = herd.get_entity(entity_id='0000-0001-6782-3819')
          
                            entity_uri = None if entity is not None else 'https://orcid.org/0000-0001-6782-3819'
          
                            key = read_file.experimenter[0]
          
                            try:
          
                                # The experimenter name is unique across files in this DANDIset so reuse the key if possible
          
                                key = herd.get_key(key)  
          
                            except ValueError:  
          
                                pass   # If experimenter is not yet in HERD, then add a new key
          
                            herd.add_ref(file=read_file,
          
                                         container=read_file,
          
                                         attribute="experimenter",
          
                                         key=key,
          
                                         entity_id = '0000-0001-6782-3819',
          
                                         entity_uri = entity_uri
          
                                         )

Contributor

oruebel Oct 8, 2023

In this way we can show that reusing keys across mutliple files is allowed

oruebel reviewed

View reviewed changes

docs/gallery/general/plot_resources.py

+                          with pynwb.NWBHDF5IO(file=file, load_namespaces=True) as io:
+                              read_file = io.read()
+                              # ADD HERD for Subject species
+                              entity = herd.get_entity(entity_id='NCBI_TAXON:10090')

Contributor

oruebel Oct 8, 2023

If we are already fetching the entity, then it would be nice if add_ref accepted the entity as input, rather than having to specify entity_id and entity_uri each time.

oruebel reviewed

View reviewed changes

docs/gallery/general/plot_resources.py

+                                           key=read_file.experimenter[0],
+                                           entity_id = '0000-0001-6782-3819',
+                                           entity_uri = entity_uri
+                                           )

Contributor

oruebel Oct 8, 2023

Suggested change

      
                                         )
          
                                         )
          
            # Visualize the HERD file
          
            herd.to_dataframe()
          
            # Visualize the individual tables. Here we can see that only 2 unique entities were created. 
          
            # We can also see, that for experimenter we only created a single key that is reused across files.
          
            #
          
            # **Files**
          
            herd.files.to_dataframe()
          
            # **Objects**
          
            herd.objects.to_dataframe()
          
            # **Keys**
          
            herd.keys.to_dataframe()
          
            # **Object_Keys**
          
            herd.object_keys.to_dataframe()
          
            #  **Entities**
          
            herd.entities.to_dataframe()
          
            # **Entities_Keys**
          
            display(herd.entity_keys.to_dataframe()

oruebel reviewed

View reviewed changes

docs/gallery/general/plot_resources.py

+              ##################################################
+              # Steaming an entire Dandiset for HERD
+              # ---------------------------------

Contributor

oruebel Oct 8, 2023

Suggested change

      
            # ---------------------------------
          
            # --------------------------------------

oruebel reviewed

View reviewed changes

docs/gallery/general/plot_resources.py

Comment on lines +335 to +336

		er_read = HERD.from_zip(path='./HERD.zip')
		os.remove('./HERD.zip')

Contributor

oruebel Oct 8, 2023

Suggested change

      
            er_read = HERD.from_zip(path='./HERD.zip')
          
            os.remove('./HERD.zip')
          
            er_read = HERD.from_zip(path='HERD.zip')
          
            os.remove('HERD.zip')

oruebel reviewed

View reviewed changes

docs/gallery/general/plot_resources.py

Comment on lines +220 to +225

+              er.files.to_dataframe()
+              er.objects.to_dataframe()
+              er.entities.to_dataframe()
+              er.keys.to_dataframe()
+              er.object_keys.to_dataframe()
+              er.entity_keys.to_dataframe()

Contributor

oruebel Oct 8, 2023

This will only show the last table (here entity_keys) in the output. You will either need to call display(er.files.to_dataframe()) for each of the DataFrames or separate them into cell.

oruebel reviewed

View reviewed changes

docs/gallery/general/plot_resources.py

Comment on lines +185 to +208

+              ###############################################################################
+              # Using the add_ref method without the file parameter.
+              # ------------------------------------------------------
+              # Even though :py:class:`~pynwb.resources.File` is required to create/add a new reference,
+              # the user can omit the file parameter if the :py:class:`~pynwb.resources.Object` has a file
+              # in its parent hierarchy.
+              col1 = VectorData(
+                  name='Species_Data',
+                  description='species from NCBI and Ensemble',
+                  data=['Homo sapiens', 'Ursus arctos horribilis'],
+              )
+              # Create a DynamicTable with this column and set the table parent to the file object created earlier
+              species = DynamicTable(name='species', description='My species', columns=[col1])
+              species.parent = file
+              er.add_ref(
+                  container=species,
+                  attribute='Species_Data',
+                  key='Ursus arctos horribilis',
+                  entity_id='NCBI_TAXON:116960',
+                  entity_uri='https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id'
+              )

Contributor

oruebel Oct 8, 2023

We shouldn't use the example with a species column in the PyNWB tutorial because it will be confusing for NWB users since there is already a Species field in NWB

oruebel reviewed

View reviewed changes

docs/gallery/general/plot_resources.py

Comment on lines +11 to +86

+              Introduction
+              -------------
+              The :py:class:`~pynwb.resources.HERD` class provides a way
+              to organize and map user terms from their data (keys) to multiple entities
+              from the external resources. A typical use case for external resources is to link data
+              stored in datasets or attributes to ontologies. For example, you may have a
+              dataset ``country`` storing locations. Using
+              :py:class:`~pynwb.resources.HERD` allows us to link the
+              country names stored in the dataset to an ontology of all countries, enabling
+              more rigid standardization of the data and facilitating data query and
+              introspection.
+              From a user's perspective, one can think of the
+              :py:class:`~pynwb.resources.HERD` as a simple table, in which each
+              row associates a particular ``key`` stored in a particular ``object`` (i.e., Attribute
+              or Dataset in a file) with a particular ``entity`` (i.e, a term of an online
+              resource). That is, ``(object, key)`` refer to parts inside a
+              file and ``entity`` refers to an external resource outside the file, and
+              :py:class:`~pynwb.resources.HERD` allows us to link the two. To
+              reduce data redundancy and improve data integrity,
+              :py:class:`~pynwb.resources.HERD` stores this data internally in a
+              collection of interlinked tables.
+              * :py:class:`~pynwb.resources.KeyTable` where each row describes a
+                :py:class:`~pynwb.resources.Key`
+              * :py:class:`~pynwb.resources.FileTable` where each row describes a
+                :py:class:`~pynwb.resources.File`
+              * :py:class:`~pynwb.resources.EntityTable` where each row describes an
+                :py:class:`~pynwb.resources.Entity`
+              * :py:class:`~pynwb.resources.EntityKeyTable` where each row describes an
+                :py:class:`~pynwb.resources.EntityKey`
+              * :py:class:`~pynwb.resources.ObjectTable` where each row describes an
+                :py:class:`~pynwb.resources.Object`
+              * :py:class:`~pynwb.resources.ObjectKeyTable` where each row describes an
+                :py:class:`~pynwb.resources.ObjectKey` pair identifying which keys
+                are used by which objects.
+              The :py:class:`~pynwb.resources.HERD` class then provides
+              convenience functions to simplify interaction with these tables, allowing users
+              to treat :py:class:`~pynwb.resources.HERD` as a single large table as
+              much as possible.
+              Rules to HERD
+              ---------------------------
+              When using the :py:class:`~pynwb.resources.HERD` class, there
+              are rules to how users store information in the interlinked tables.
+. Multiple :py:class:`~pynwb.resources.Key` objects can have the same name.
+                 They are disambiguated by the :py:class:`~pynwb.resources.Object` associated
+                 with each, meaning we may have keys with the same name in different objects, but for a particular object
+                 all keys must be unique.
+. In order to query specific records, the :py:class:`~pynwb.resources.HERD` class
+                 uses '(file, object_id, relative_path, field, key)' as the unique identifier.
+. :py:class:`~pynwb.resources.Object` can have multiple :py:class:`~pynwb.resources.Key`
+                 objects.
+. Multiple :py:class:`~pynwb.resources.Object` objects can use the same :py:class:`~pynwb.resources.Key`.
+. Do not use the private methods to add into the :py:class:`~pynwb.resources.KeyTable`,
+                 :py:class:`~pynwb.resources.FileTable`, :py:class:`~pynwb.resources.EntityTable`,
+                 :py:class:`~pynwb.resources.ObjectTable`, :py:class:`~pynwb.resources.ObjectKeyTable`,
+                 :py:class:`~pynwb.resources.EntityKeyTable` individually.
+. URIs are optional, but highly recommended. If not known, an empty string may be used.
+. An entity ID should be the unique string identifying the entity in the given resource.
+                 This may or may not include a string representing the resource and a colon.
+                 Use the format provided by the resource. For example, Identifiers.org uses the ID ``ncbigene:22353``
+                 but the NCBI Gene uses the ID ``22353`` for the same term.
+. In a majority of cases, :py:class:`~pynwb.resources.Object` objects will have an empty string
+                 for 'field'. The :py:class:`~pynwb.resources.HERD` class supports compound data_types.
+                 In that case, 'field' would be the field of the compound data_type that has an external reference.
+. In some cases, the attribute that needs an external reference is not a object with a 'data_type'.
+                 The user must then use the nearest object that has a data type to be used as the parent object. When
+                 adding an external resource for an object with a data type, users should not provide an attribute.
+                 When adding an external resource for an attribute of an object, users need to provide
+                 the name of the attribute.
+. The user must provide a :py:class:`~pynwb.resources.File` or an :py:class:`~pynwb.resources.Object` that
+                  has :py:class:`~pynwb.resources.File` along the parent hierarchy.
+              """

Contributor

oruebel Oct 8, 2023

I think this is too much low-level detail. This is information that we should point to the tutorial in HDMF for. In the PyNWB tutorial we should focus on NWB-specific examples and walk through the typical use with NWB.

As such, I would suggest a structure that focuses more directly on the different use-cases of NWB Users:

Annotate data while creating a new NWBFile:
- Show and example where Species is wrapped with a TermSetWrapper works
- Show how wrapping a whole VectorData column with TermSetWrapper works
- Show how wrapping with TermSetWrapper allows one to: a) validate data and b) automatically create at the end a HERD file
- Show and example where add_ref is being used to add a custom reference (e.g., for experimenter)
Show how annotating an existing NWBFile works (here I think we can use the DANDIset example)
Show how reading and querying a HERD file works

Collaborator Author

mavaylon1 Oct 8, 2023

Yes it's part of the PR TODO items to do pynwb examples and not hdmf

Contributor

rly commented Sep 19, 2024

@mavaylon1 reminder to update this when you have a chance

rly mentioned this pull request

Add ontology support #15

Closed

stephprince mentioned this pull request

[Documentation]: Add HERD Tutorial #1984

Open

3 tasks

rly mentioned this pull request

Add ontology support NeurodataWithoutBorders/nwb-schema#1

Closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

category: enhancement priority: medium topic: docs