-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add HERD gallery to pynwb with streaming #1781
base: dev
Are you sure you want to change the base?
Conversation
Codecov Report
@@ Coverage Diff @@
## dev #1781 +/- ##
=======================================
Coverage 91.75% 91.75%
=======================================
Files 27 27
Lines 2619 2619
Branches 684 684
=======================================
Hits 2403 2403
Misses 141 141
Partials 75 75
Flags with carried forward coverage won't be shown. Click here to find out more. 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
# Add HERD for Experimenter | ||
entity = herd.get_entity(entity_id='0000-0001-6782-3819') | ||
entity_uri = None if entity is not None else 'https://orcid.org/0000-0001-6782-3819' | ||
herd.add_ref(file=read_file, | ||
container=read_file, | ||
attribute="experimenter", | ||
key=read_file.experimenter[0], | ||
entity_id = '0000-0001-6782-3819', | ||
entity_uri = entity_uri | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Add HERD for Experimenter | |
entity = herd.get_entity(entity_id='0000-0001-6782-3819') | |
entity_uri = None if entity is not None else 'https://orcid.org/0000-0001-6782-3819' | |
herd.add_ref(file=read_file, | |
container=read_file, | |
attribute="experimenter", | |
key=read_file.experimenter[0], | |
entity_id = '0000-0001-6782-3819', | |
entity_uri = entity_uri | |
) | |
# Add HERD for Experimenter | |
entity = herd.get_entity(entity_id='0000-0001-6782-3819') | |
entity_uri = None if entity is not None else 'https://orcid.org/0000-0001-6782-3819' | |
key = read_file.experimenter[0] | |
try: | |
# The experimenter name is unique across files in this DANDIset so reuse the key if possible | |
key = herd.get_key(key) | |
except ValueError: | |
pass # If experimenter is not yet in HERD, then add a new key | |
herd.add_ref(file=read_file, | |
container=read_file, | |
attribute="experimenter", | |
key=key, | |
entity_id = '0000-0001-6782-3819', | |
entity_uri = entity_uri | |
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this way we can show that reusing keys across mutliple files is allowed
with pynwb.NWBHDF5IO(file=file, load_namespaces=True) as io: | ||
read_file = io.read() | ||
# ADD HERD for Subject species | ||
entity = herd.get_entity(entity_id='NCBI_TAXON:10090') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are already fetching the entity, then it would be nice if add_ref
accepted the entity as input, rather than having to specify entity_id
and entity_uri
each time.
key=read_file.experimenter[0], | ||
entity_id = '0000-0001-6782-3819', | ||
entity_uri = entity_uri | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
) | |
) | |
# Visualize the HERD file | |
herd.to_dataframe() | |
# Visualize the individual tables. Here we can see that only 2 unique entities were created. | |
# We can also see, that for experimenter we only created a single key that is reused across files. | |
# | |
# **Files** | |
herd.files.to_dataframe() | |
# **Objects** | |
herd.objects.to_dataframe() | |
# **Keys** | |
herd.keys.to_dataframe() | |
# **Object_Keys** | |
herd.object_keys.to_dataframe() | |
# **Entities** | |
herd.entities.to_dataframe() | |
# **Entities_Keys** | |
display(herd.entity_keys.to_dataframe() |
|
||
################################################## | ||
# Steaming an entire Dandiset for HERD | ||
# --------------------------------- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# --------------------------------- | |
# -------------------------------------- |
er_read = HERD.from_zip(path='./HERD.zip') | ||
os.remove('./HERD.zip') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
er_read = HERD.from_zip(path='./HERD.zip') | |
os.remove('./HERD.zip') | |
er_read = HERD.from_zip(path='HERD.zip') | |
os.remove('HERD.zip') |
er.files.to_dataframe() | ||
er.objects.to_dataframe() | ||
er.entities.to_dataframe() | ||
er.keys.to_dataframe() | ||
er.object_keys.to_dataframe() | ||
er.entity_keys.to_dataframe() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will only show the last table (here entity_keys) in the output. You will either need to call display(er.files.to_dataframe())
for each of the DataFrames or separate them into cell.
############################################################################### | ||
# Using the add_ref method without the file parameter. | ||
# ------------------------------------------------------ | ||
# Even though :py:class:`~pynwb.resources.File` is required to create/add a new reference, | ||
# the user can omit the file parameter if the :py:class:`~pynwb.resources.Object` has a file | ||
# in its parent hierarchy. | ||
|
||
col1 = VectorData( | ||
name='Species_Data', | ||
description='species from NCBI and Ensemble', | ||
data=['Homo sapiens', 'Ursus arctos horribilis'], | ||
) | ||
|
||
# Create a DynamicTable with this column and set the table parent to the file object created earlier | ||
species = DynamicTable(name='species', description='My species', columns=[col1]) | ||
species.parent = file | ||
|
||
er.add_ref( | ||
container=species, | ||
attribute='Species_Data', | ||
key='Ursus arctos horribilis', | ||
entity_id='NCBI_TAXON:116960', | ||
entity_uri='https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id' | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't use the example with a species column in the PyNWB tutorial because it will be confusing for NWB users since there is already a Species field in NWB
Introduction | ||
------------- | ||
The :py:class:`~pynwb.resources.HERD` class provides a way | ||
to organize and map user terms from their data (keys) to multiple entities | ||
from the external resources. A typical use case for external resources is to link data | ||
stored in datasets or attributes to ontologies. For example, you may have a | ||
dataset ``country`` storing locations. Using | ||
:py:class:`~pynwb.resources.HERD` allows us to link the | ||
country names stored in the dataset to an ontology of all countries, enabling | ||
more rigid standardization of the data and facilitating data query and | ||
introspection. | ||
|
||
From a user's perspective, one can think of the | ||
:py:class:`~pynwb.resources.HERD` as a simple table, in which each | ||
row associates a particular ``key`` stored in a particular ``object`` (i.e., Attribute | ||
or Dataset in a file) with a particular ``entity`` (i.e, a term of an online | ||
resource). That is, ``(object, key)`` refer to parts inside a | ||
file and ``entity`` refers to an external resource outside the file, and | ||
:py:class:`~pynwb.resources.HERD` allows us to link the two. To | ||
reduce data redundancy and improve data integrity, | ||
:py:class:`~pynwb.resources.HERD` stores this data internally in a | ||
collection of interlinked tables. | ||
|
||
* :py:class:`~pynwb.resources.KeyTable` where each row describes a | ||
:py:class:`~pynwb.resources.Key` | ||
* :py:class:`~pynwb.resources.FileTable` where each row describes a | ||
:py:class:`~pynwb.resources.File` | ||
* :py:class:`~pynwb.resources.EntityTable` where each row describes an | ||
:py:class:`~pynwb.resources.Entity` | ||
* :py:class:`~pynwb.resources.EntityKeyTable` where each row describes an | ||
:py:class:`~pynwb.resources.EntityKey` | ||
* :py:class:`~pynwb.resources.ObjectTable` where each row describes an | ||
:py:class:`~pynwb.resources.Object` | ||
* :py:class:`~pynwb.resources.ObjectKeyTable` where each row describes an | ||
:py:class:`~pynwb.resources.ObjectKey` pair identifying which keys | ||
are used by which objects. | ||
|
||
The :py:class:`~pynwb.resources.HERD` class then provides | ||
convenience functions to simplify interaction with these tables, allowing users | ||
to treat :py:class:`~pynwb.resources.HERD` as a single large table as | ||
much as possible. | ||
|
||
Rules to HERD | ||
--------------------------- | ||
When using the :py:class:`~pynwb.resources.HERD` class, there | ||
are rules to how users store information in the interlinked tables. | ||
|
||
1. Multiple :py:class:`~pynwb.resources.Key` objects can have the same name. | ||
They are disambiguated by the :py:class:`~pynwb.resources.Object` associated | ||
with each, meaning we may have keys with the same name in different objects, but for a particular object | ||
all keys must be unique. | ||
2. In order to query specific records, the :py:class:`~pynwb.resources.HERD` class | ||
uses '(file, object_id, relative_path, field, key)' as the unique identifier. | ||
3. :py:class:`~pynwb.resources.Object` can have multiple :py:class:`~pynwb.resources.Key` | ||
objects. | ||
4. Multiple :py:class:`~pynwb.resources.Object` objects can use the same :py:class:`~pynwb.resources.Key`. | ||
5. Do not use the private methods to add into the :py:class:`~pynwb.resources.KeyTable`, | ||
:py:class:`~pynwb.resources.FileTable`, :py:class:`~pynwb.resources.EntityTable`, | ||
:py:class:`~pynwb.resources.ObjectTable`, :py:class:`~pynwb.resources.ObjectKeyTable`, | ||
:py:class:`~pynwb.resources.EntityKeyTable` individually. | ||
6. URIs are optional, but highly recommended. If not known, an empty string may be used. | ||
7. An entity ID should be the unique string identifying the entity in the given resource. | ||
This may or may not include a string representing the resource and a colon. | ||
Use the format provided by the resource. For example, Identifiers.org uses the ID ``ncbigene:22353`` | ||
but the NCBI Gene uses the ID ``22353`` for the same term. | ||
8. In a majority of cases, :py:class:`~pynwb.resources.Object` objects will have an empty string | ||
for 'field'. The :py:class:`~pynwb.resources.HERD` class supports compound data_types. | ||
In that case, 'field' would be the field of the compound data_type that has an external reference. | ||
9. In some cases, the attribute that needs an external reference is not a object with a 'data_type'. | ||
The user must then use the nearest object that has a data type to be used as the parent object. When | ||
adding an external resource for an object with a data type, users should not provide an attribute. | ||
When adding an external resource for an attribute of an object, users need to provide | ||
the name of the attribute. | ||
10. The user must provide a :py:class:`~pynwb.resources.File` or an :py:class:`~pynwb.resources.Object` that | ||
has :py:class:`~pynwb.resources.File` along the parent hierarchy. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is too much low-level detail. This is information that we should point to the tutorial in HDMF for. In the PyNWB tutorial we should focus on NWB-specific examples and walk through the typical use with NWB.
As such, I would suggest a structure that focuses more directly on the different use-cases of NWB Users:
- Annotate data while creating a new
NWBFile
:- Show and example where
Species
is wrapped with a TermSetWrapper works - Show how wrapping a whole
VectorData
column with TermSetWrapper works - Show how wrapping with TermSetWrapper allows one to: a) validate data and b) automatically create at the end a HERD file
- Show and example where add_ref is being used to add a custom reference (e.g., for experimenter)
- Show and example where
- Show how annotating an existing
NWBFile
works (here I think we can use the DANDIset example) - Show how reading and querying a HERD file works
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it's part of the PR TODO items to do pynwb examples and not hdmf
@mavaylon1 reminder to update this when you have a chance |
Motivation
What was the reasoning behind this change? Please explain the changes briefly.
Now that
HERD
is in pynwb, it should have its own gallery. It should also include an example that handles streaming a dandiset in order to add metadata annotations.HerdManager
withNWBFile
How to test the behavior?
Run the doc.
Checklist
flake8
from the source directory.