Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xarray support of Datasets packages #80

Open
JohnMrziglod opened this issue Jan 2, 2018 · 0 comments
Open

xarray support of Datasets packages #80

JohnMrziglod opened this issue Jan 2, 2018 · 0 comments
Assignees
Labels
discussion Conversation about feature ideas

Comments

@JohnMrziglod
Copy link
Member

[from @gerritholl]

Personally, I would like to see something fully built around xarray.DataArray and xarray.Dataset (it is unfortunate that our class is also called Dataset). I wasn't aware of xarray when I first started the typhon.datasets package. The typhon.spareice.datasets approach uses the ArrayGroup class which appears very similar to an xarray.Dataset, whereas the Array appears very similar to xarray.DataArray, yet they aren't built around it.

I started to build spareice.datasets on xarray but I ran into some trouble hence I wrote my own Array and ArrayGroup implementations. Maybe you have some suggestions for me how to solve them. Here are my thoughts about xarray:

  • xarray.Dataset does not have a native group support as we know it from netCDF4 and there is no development on that topic at the moment (afaik from this thread: Dataset groups pydata/xarray#1092). This is a huge drawback for me since I use the group support extensively especially in the collocations tool. It makes it easier to combine data coming from different datasets. This was the main reason for implementing ArrayGroup which allows grouping by using unix-like paths.
  • I really like xarray for working with labeled data but when doing elementwise computations it is starting to become a mess due to the label alignment of the underlying pandas (https://pandas.pydata.org/pandas-docs/stable/dsintro.html#vectorized-operations-and-label-alignment-with-series). Hence I preferred to use plain but simple numpy.arrays that are very straightforward to handle.
  • Extending xarray should be done via accessors: the developer discourage to subclass xarray directly and suggest writing accessors (http://xarray.pydata.org/en/stable/internals.html#extending-xarray) which follow the composition-over-inheritance principle (well, in general that is a good thing :-) ). But it hold me off to dig deeper into extending xarrays.

Nevertheless xarray has a great dask support which makes it preferable for big data applications and it seems to have a bright future. So I totally agree with you that future specific datasets implementations should support xarray objects. I therefore try to make ArrayGroup compatible to xarray objects. But in general, I think the actual Dataset base class should be independent from its file content - therefore it should not care about xarrays, ArrayGroups or whatever. This makes it more powerful and also usable for datasets of other data types (e.g. text based or images).

What do you think?

@JohnMrziglod JohnMrziglod added the discussion Conversation about feature ideas label Jan 2, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion Conversation about feature ideas
Projects
None yet
Development

No branches or pull requests

2 participants