Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: add schema for "projection" entry in start document #130

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

tacaswell
Copy link
Contributor

@tacaswell tacaswell commented Dec 10, 2019

Provide a semantic mapping of the keys in the collected data to a known set of keys drawn from an externally owned vocabulary.

@danielballan
Copy link
Member

This PR could use a description or a link to some meeting notes if there are any.

@danielballan
Copy link
Member

On a Pilot call @dylanmcreynolds nudged us to move forward with this. I am personally happy with it but given the importance and the cost of any future changes I think we should have at least one more meeting to pick apart the structure and the names and consider alternatives.

Copy link
Member

@stuartcampbell stuartcampbell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to see how this maps on to a more developed schema such as the NeXus SAS definition (https://manual.nexusformat.org/classes/applications/NXsas.html)

"type": "object",
"properties" : {
"stream": {"type": "string"},
"location": {"enum" : ["event", "configuration"]},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if it's in the stop document? Or elsewhere in the EventDescriptor, such as the source or shape? Would a dotted object representation be simpler and more comprehensive?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Being in the start document is a far more compelling argument.

I'm skeptical of embracing the dot-ness (as we have previously agreed ed that dot access on dicts is not great) so the code to munge that back to something we can actually use will be annoying, but on the other hand we can write the function once and stuff it it databroker).

"required" : ["stream", "location", "field"],
"additionalProperties": false
},
"technique": {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The term "technique" might be too limiting. This provides a generic mechanism for mapping any externally-defined metadata schema to the contents of the documents to come. Those schemas might be broken up by experimental technique, by downstream analysis process (applicable to more than one technique such as "scattering" and "diffraction"), by institution, by domain, etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"data_remapping", "datamap", "DateMap", "Application", "application_definitions"?

I think we have a bunch of helper functions like list_applications(h), iter_applications(h, name=None) that yields dicts (?) full of {base types or xarrays}?

@danielballan
Copy link
Member

👍 for @stuartcampbell's request to use this on a fully-worked example or three before we commit to it.

@danielballan
Copy link
Member

@tacaswell on Slack:

move "techniques" out of start document and into [its] own document class

@danielballan
Copy link
Member

I am in favor of proceeding on this but doing so in a separate experimental document type that can evolve quickly and make breaking changes as needed. I think @tacaswell suggested this in passing on Slack, as I tersely documented at the time, above.

@danielballan
Copy link
Member

danielballan commented Jun 2, 2020

There was a call on this subject today.

Things there seemed to be strong consensus on:

  • There will be a list of dicts, with each dict describing how to project the content of the documents on to some externally-defined schema. (In some situations there may be multiple ways to map the documents' contents onto the same schema; the same schema name might appear more than once. That is why why this is a list and not a dict keyed on schema name.) At top level, the dicts will describe the schema itself: a name ("NXtomo"), a version, an external URL to a schema definition, and finally a dict embodying the layout (e.g. Nexus definition).
  • Within this layout, the "leaves" will be either pointers into a document (with the document name) or a literal value (with "literal" or something like that).
  • This list of dicts will be included in the document stream, either within the Run Start document itself or in a temporary experimental document type (see below).

Things that might need more thought or discussion to build strong consensus:

  • What to call the key: "techniques", "projections", or something else
  • How the top-level keys (the name, version, url, etc.) are spelled and which are required
  • How to structure and spell the pointers into documents
  • Whether to add the notion of "experimental document type" to event-model, bluesky, and databroker so as to prototype this idea there, or just add this to the Run Start document from the get-go.

Things that need investigation:

  • How to encode an XML attribute in JSON. @stuartcampbell reported that there are ~5 different standards for this. At some level we'd be happy to just pick one, but it's worth doing some due diligence regarding whether any of them have clear technical advantages or seem to have higher adoption than the others.

@prjemian
Copy link
Contributor

prjemian commented Jun 3, 2020

First off, can someone please edit the top box here and describe clearly the intent of this PR, as previously requested? Without this focus stated, the discussion is not focused.

@dylanmcreynolds
Copy link
Contributor

dylanmcreynolds commented Jun 3, 2020

I have heard taht NeXus has deprecated XML backends. This leads to the question of how much XML you plan to support transforming to? Namespacing is great, but there seems to be a huge debate on how or if to represent that in JSON. It seems like you could punt on some complexity if you don't intend to output XML.

@tacaswell tacaswell changed the title ENH: add schema for technique entry in start document ENH: add schema for "projection" entry in start document Jun 4, 2020
@prjemian
Copy link
Contributor

prjemian commented Jun 4, 2020

That's right. HDF5 is now the only supported on-disk format for NeXus data files. The decision to drop the XML backend was between 2012 and 2014-08. (Can't find the specific decision in the notes yet.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants