Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add HERD gallery to pynwb with streaming #1781

Draft
wants to merge 1 commit into
base: dev
Choose a base branch
from
Draft

Conversation

mavaylon1
Copy link
Collaborator

@mavaylon1 mavaylon1 commented Oct 7, 2023

Motivation

What was the reasoning behind this change? Please explain the changes briefly.

Now that HERD is in pynwb, it should have its own gallery. It should also include an example that handles streaming a dandiset in order to add metadata annotations.

  • Add small description of the dandiset
  • Replace the HerdManager with NWBFile
  • Update the examples to use pynwb objects to make it more practical

How to test the behavior?

Show how to reproduce the new behavior (can be a bug fix or a new feature)

Run the doc.

Checklist

  • Did you update CHANGELOG.md with your changes?
  • Have you checked our Contributing document?
  • Have you ensured the PR clearly describes the problem and the solution?
  • Is your contribution compliant with our coding style? This can be checked running flake8 from the source directory.
  • Have you checked to ensure that there aren't other open Pull Requests for the same change?
  • Have you included the relevant issue number using "Fix #XXX" notation where XXX is the issue number? By including "Fix #XXX" you allow GitHub to close issue #XXX when the PR is merged.

@mavaylon1 mavaylon1 self-assigned this Oct 7, 2023
@mavaylon1 mavaylon1 added category: enhancement improvements of code or code behavior priority: medium non-critical problem and/or affecting only a small set of NWB users topic: docs issues related to documentation labels Oct 7, 2023
@codecov
Copy link

codecov bot commented Oct 7, 2023

Codecov Report

Merging #1781 (52e7a28) into dev (2aceed0) will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##              dev    #1781   +/-   ##
=======================================
  Coverage   91.75%   91.75%           
=======================================
  Files          27       27           
  Lines        2619     2619           
  Branches      684      684           
=======================================
  Hits         2403     2403           
  Misses        141      141           
  Partials       75       75           
Flag Coverage Δ
integration 71.09% <ø> (ø)
unit 83.46% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@mavaylon1 mavaylon1 mentioned this pull request Oct 7, 2023
6 tasks
Comment on lines +387 to +396
# Add HERD for Experimenter
entity = herd.get_entity(entity_id='0000-0001-6782-3819')
entity_uri = None if entity is not None else 'https://orcid.org/0000-0001-6782-3819'
herd.add_ref(file=read_file,
container=read_file,
attribute="experimenter",
key=read_file.experimenter[0],
entity_id = '0000-0001-6782-3819',
entity_uri = entity_uri
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Add HERD for Experimenter
entity = herd.get_entity(entity_id='0000-0001-6782-3819')
entity_uri = None if entity is not None else 'https://orcid.org/0000-0001-6782-3819'
herd.add_ref(file=read_file,
container=read_file,
attribute="experimenter",
key=read_file.experimenter[0],
entity_id = '0000-0001-6782-3819',
entity_uri = entity_uri
)
# Add HERD for Experimenter
entity = herd.get_entity(entity_id='0000-0001-6782-3819')
entity_uri = None if entity is not None else 'https://orcid.org/0000-0001-6782-3819'
key = read_file.experimenter[0]
try:
# The experimenter name is unique across files in this DANDIset so reuse the key if possible
key = herd.get_key(key)
except ValueError:
pass # If experimenter is not yet in HERD, then add a new key
herd.add_ref(file=read_file,
container=read_file,
attribute="experimenter",
key=key,
entity_id = '0000-0001-6782-3819',
entity_uri = entity_uri
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this way we can show that reusing keys across mutliple files is allowed

with pynwb.NWBHDF5IO(file=file, load_namespaces=True) as io:
read_file = io.read()
# ADD HERD for Subject species
entity = herd.get_entity(entity_id='NCBI_TAXON:10090')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are already fetching the entity, then it would be nice if add_ref accepted the entity as input, rather than having to specify entity_id and entity_uri each time.

key=read_file.experimenter[0],
entity_id = '0000-0001-6782-3819',
entity_uri = entity_uri
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
)
)
# Visualize the HERD file
herd.to_dataframe()
# Visualize the individual tables. Here we can see that only 2 unique entities were created.
# We can also see, that for experimenter we only created a single key that is reused across files.
#
# **Files**
herd.files.to_dataframe()
# **Objects**
herd.objects.to_dataframe()
# **Keys**
herd.keys.to_dataframe()
# **Object_Keys**
herd.object_keys.to_dataframe()
# **Entities**
herd.entities.to_dataframe()
# **Entities_Keys**
display(herd.entity_keys.to_dataframe()


##################################################
# Steaming an entire Dandiset for HERD
# ---------------------------------
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# ---------------------------------
# --------------------------------------

Comment on lines +335 to +336
er_read = HERD.from_zip(path='./HERD.zip')
os.remove('./HERD.zip')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
er_read = HERD.from_zip(path='./HERD.zip')
os.remove('./HERD.zip')
er_read = HERD.from_zip(path='HERD.zip')
os.remove('HERD.zip')

Comment on lines +220 to +225
er.files.to_dataframe()
er.objects.to_dataframe()
er.entities.to_dataframe()
er.keys.to_dataframe()
er.object_keys.to_dataframe()
er.entity_keys.to_dataframe()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will only show the last table (here entity_keys) in the output. You will either need to call display(er.files.to_dataframe()) for each of the DataFrames or separate them into cell.

Comment on lines +185 to +208
###############################################################################
# Using the add_ref method without the file parameter.
# ------------------------------------------------------
# Even though :py:class:`~pynwb.resources.File` is required to create/add a new reference,
# the user can omit the file parameter if the :py:class:`~pynwb.resources.Object` has a file
# in its parent hierarchy.

col1 = VectorData(
name='Species_Data',
description='species from NCBI and Ensemble',
data=['Homo sapiens', 'Ursus arctos horribilis'],
)

# Create a DynamicTable with this column and set the table parent to the file object created earlier
species = DynamicTable(name='species', description='My species', columns=[col1])
species.parent = file

er.add_ref(
container=species,
attribute='Species_Data',
key='Ursus arctos horribilis',
entity_id='NCBI_TAXON:116960',
entity_uri='https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id'
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't use the example with a species column in the PyNWB tutorial because it will be confusing for NWB users since there is already a Species field in NWB

Comment on lines +11 to +86
Introduction
-------------
The :py:class:`~pynwb.resources.HERD` class provides a way
to organize and map user terms from their data (keys) to multiple entities
from the external resources. A typical use case for external resources is to link data
stored in datasets or attributes to ontologies. For example, you may have a
dataset ``country`` storing locations. Using
:py:class:`~pynwb.resources.HERD` allows us to link the
country names stored in the dataset to an ontology of all countries, enabling
more rigid standardization of the data and facilitating data query and
introspection.

From a user's perspective, one can think of the
:py:class:`~pynwb.resources.HERD` as a simple table, in which each
row associates a particular ``key`` stored in a particular ``object`` (i.e., Attribute
or Dataset in a file) with a particular ``entity`` (i.e, a term of an online
resource). That is, ``(object, key)`` refer to parts inside a
file and ``entity`` refers to an external resource outside the file, and
:py:class:`~pynwb.resources.HERD` allows us to link the two. To
reduce data redundancy and improve data integrity,
:py:class:`~pynwb.resources.HERD` stores this data internally in a
collection of interlinked tables.

* :py:class:`~pynwb.resources.KeyTable` where each row describes a
:py:class:`~pynwb.resources.Key`
* :py:class:`~pynwb.resources.FileTable` where each row describes a
:py:class:`~pynwb.resources.File`
* :py:class:`~pynwb.resources.EntityTable` where each row describes an
:py:class:`~pynwb.resources.Entity`
* :py:class:`~pynwb.resources.EntityKeyTable` where each row describes an
:py:class:`~pynwb.resources.EntityKey`
* :py:class:`~pynwb.resources.ObjectTable` where each row describes an
:py:class:`~pynwb.resources.Object`
* :py:class:`~pynwb.resources.ObjectKeyTable` where each row describes an
:py:class:`~pynwb.resources.ObjectKey` pair identifying which keys
are used by which objects.

The :py:class:`~pynwb.resources.HERD` class then provides
convenience functions to simplify interaction with these tables, allowing users
to treat :py:class:`~pynwb.resources.HERD` as a single large table as
much as possible.

Rules to HERD
---------------------------
When using the :py:class:`~pynwb.resources.HERD` class, there
are rules to how users store information in the interlinked tables.

1. Multiple :py:class:`~pynwb.resources.Key` objects can have the same name.
They are disambiguated by the :py:class:`~pynwb.resources.Object` associated
with each, meaning we may have keys with the same name in different objects, but for a particular object
all keys must be unique.
2. In order to query specific records, the :py:class:`~pynwb.resources.HERD` class
uses '(file, object_id, relative_path, field, key)' as the unique identifier.
3. :py:class:`~pynwb.resources.Object` can have multiple :py:class:`~pynwb.resources.Key`
objects.
4. Multiple :py:class:`~pynwb.resources.Object` objects can use the same :py:class:`~pynwb.resources.Key`.
5. Do not use the private methods to add into the :py:class:`~pynwb.resources.KeyTable`,
:py:class:`~pynwb.resources.FileTable`, :py:class:`~pynwb.resources.EntityTable`,
:py:class:`~pynwb.resources.ObjectTable`, :py:class:`~pynwb.resources.ObjectKeyTable`,
:py:class:`~pynwb.resources.EntityKeyTable` individually.
6. URIs are optional, but highly recommended. If not known, an empty string may be used.
7. An entity ID should be the unique string identifying the entity in the given resource.
This may or may not include a string representing the resource and a colon.
Use the format provided by the resource. For example, Identifiers.org uses the ID ``ncbigene:22353``
but the NCBI Gene uses the ID ``22353`` for the same term.
8. In a majority of cases, :py:class:`~pynwb.resources.Object` objects will have an empty string
for 'field'. The :py:class:`~pynwb.resources.HERD` class supports compound data_types.
In that case, 'field' would be the field of the compound data_type that has an external reference.
9. In some cases, the attribute that needs an external reference is not a object with a 'data_type'.
The user must then use the nearest object that has a data type to be used as the parent object. When
adding an external resource for an object with a data type, users should not provide an attribute.
When adding an external resource for an attribute of an object, users need to provide
the name of the attribute.
10. The user must provide a :py:class:`~pynwb.resources.File` or an :py:class:`~pynwb.resources.Object` that
has :py:class:`~pynwb.resources.File` along the parent hierarchy.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is too much low-level detail. This is information that we should point to the tutorial in HDMF for. In the PyNWB tutorial we should focus on NWB-specific examples and walk through the typical use with NWB.

As such, I would suggest a structure that focuses more directly on the different use-cases of NWB Users:

  1. Annotate data while creating a new NWBFile:
    • Show and example where Species is wrapped with a TermSetWrapper works
    • Show how wrapping a whole VectorData column with TermSetWrapper works
    • Show how wrapping with TermSetWrapper allows one to: a) validate data and b) automatically create at the end a HERD file
    • Show and example where add_ref is being used to add a custom reference (e.g., for experimenter)
  2. Show how annotating an existing NWBFile works (here I think we can use the DANDIset example)
  3. Show how reading and querying a HERD file works

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it's part of the PR TODO items to do pynwb examples and not hdmf

@rly
Copy link
Contributor

rly commented Sep 19, 2024

@mavaylon1 reminder to update this when you have a chance

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: enhancement improvements of code or code behavior priority: medium non-critical problem and/or affecting only a small set of NWB users topic: docs issues related to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants