Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use case: UCLA-ROMS fluid HPC model output #217

Open
TomNicholas opened this issue Aug 7, 2024 · 3 comments
Open

Use case: UCLA-ROMS fluid HPC model output #217

TomNicholas opened this issue Aug 7, 2024 · 3 comments
Labels
usage example Real world use case examples

Comments

@TomNicholas
Copy link
Member

TomNicholas commented Aug 7, 2024

At [C]Worthy in addition to the CESM data we are also working with a oceanographic model called UCLA-ROMS. I want to use VirtualiZarr on the netCDF files written out by the ROMS code to create a single virtual Zarr store pointing to the output of an entire model run.

This problem is a pretty typical example of a fluid HPC code - its spatially parallelized (in latitude and longitude), and writes out one netCDF file per node. It also writes a new set of netCDF files for every timestep (or every fixed number of timesteps). So it's a 3D concatenation problem1.

There are a few subtleties of this particular dataset that are relevant for VirtualiZarr.

  1. Staggered grids

This is not a problem for xarray, or for zarr, but was a problem for kerchunk, and one of the motivations for making this package.

  1. Nesting

ROMS (Regional Ocean Modelling System) does local high-resolution modelling of just part of the Earth's surface, but because of the need for boundaries you often have to run multiple simulations with spatial grids that are successively "nested" inside one another (until your finest grid covers the real region of interest at the resolution desired).

So a nested ROMS simulation actually has multiple simulations, each of which we could choose to think of as a different netCDF Zarr group. Then once we have #84 we could supporting opening the entire set of nested results an xarray.DataTree object using just xr.open_datatree (see https://github.com/xarray-contrib/datatree).

  1. Variable-length chunks (maybe)

This is the actually hard one. Currently ROMS decomposes its spatial domain along one dimension (lat and/or lon) into a pattern that has lengths something like [X+2, X, ..., X, X+2]. This ultimately comes from the fact that you require 2 boundary conditions to solve 2nd-order PDEs, and these boundary cells are just tacked on to an integer decomposition across nodes.

The problem is that Zarr's current model only allows for chunk lengths along one dimension to be of the form [X, X, ..., X, Y], where Y <= X. So ROMS doesn't fit into Zarr's data model! (and the deviation is pretty much the smallest one imaginable that still screws it up 😞)

There are two ways to solve this: change Zarr, or change ROMS.

Changing ROMS is conceptually simple - if ROMS instead uses a different default domain decomposition, one which always produces [X, X, ..., X, Y] along both lat and lon, then we're golden. That might happen, but it also might not.

Changing Zarr is possible, but much more involved. It has been suggested, and worked on (the "variable-length chunks ZEP"), but that effort kind of stalled. We might get funding to do this in the next year, but if we don't need to do it get ROMS data virtualized then we should avoid this can of worms.

Footnotes

  1. It's actually basically an identical structure to the plasma fluid HPC code I worked with during my PhD - could be useful for you FYI @bendudson @johnomotani!

@TomNicholas TomNicholas added the usage example Real world use case examples label Aug 7, 2024
@mdsumner
Copy link
Contributor

mdsumner commented Aug 7, 2024

Very interested in this, will follow along with local experts and try out some products 👌

@TomNicholas
Copy link
Member Author

What's your interest @mdsumner ?

@mdsumner
Copy link
Contributor

mdsumner commented Aug 7, 2024

I help with accessibility to these formats is all. Knowing how far this virtualization scheme can go, and where the complex edges are is key to my work.

I'm also obsessed with redundancy in coordinates (and meshes generally), staggered grids are one of those very interesting cases along the spectrums of compact vs materialized. And also some models are born in actual regular map projections but stored and worked in materialized longlat arrays which is another interesting case in intermediate forms here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
usage example Real world use case examples
Projects
None yet
Development

No branches or pull requests

2 participants