Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Manipulation of coordinages do not materialize to kerchunk refs #281

Open
jbusecke opened this issue Oct 29, 2024 · 3 comments
Open

Manipulation of coordinages do not materialize to kerchunk refs #281

jbusecke opened this issue Oct 29, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@jbusecke
Copy link
Contributor

@norlandrhagen and I just came across what we believe is a bug when I manually set variables as coordinates on a virtual dataset.

To recreate I am taking a single CMIP6 output file and virtualize it:

from virtualizarr import open_virtual_dataset

url = 's3://esgf-world/CMIP6/CMIP/CCCma/CanESM5/historical/r10i1p1f1/Omon/uo/gn/v20190429/uo_Omon_CanESM5_historical_r10i1p1f1_gn_185001-186012.nc'

vds = open_virtual_dataset(url, indexes={}, reader_options={'storage_options':{'anon':True}})
vds
image

Works great, but there are some coordinates declared as variables (maybe this is related to #189? ). Either way if I try to correct this on the virtualized dataset everything seems fine

vds_modified = vds.set_coords(['latitude'])
vds_modified
image

Now I expected that these modifications would be saved when I materialize and reload the dataset

import xarray as xr
vds_modified.virtualize.to_kerchunk(
    'testing.parquet', format="parquet"
)
import xarray as xr
ds_reopened = xr.open_dataset(
    'testing.parquet',
    engine='kerchunk',
    backend_kwargs={
        'storage_options':{"remote_options":{'anon':True}}
    }
)
ds_reopened

but somehow I am getting another variable as a coordinate? Note that 'longitude' is now a coordinate all the sudden...

image

Note this is my attempt to simplify a more complex multi-file situation where we set all variables !='uo' as coordinates and the roundtripped xarray dataset did not reflect this at all. I am pretty confused about what is going on above, but hope that investigating this curious issue will clear up this bug entirely.

@jbusecke
Copy link
Contributor Author

jbusecke commented Oct 29, 2024

My suspicion here is that there is some logic that acts on variables that have identical dimensions? longitude and latitude do so.

Testing the same as above but modifying another variable:

vds_modified = vds.set_coords(['vertices_longitude'])
vds_modified.virtualize.to_kerchunk(
    'testing.parquet', format="parquet"
)
ds_reopened = xr.open_dataset(
    'testing.parquet',
    engine='kerchunk',
    backend_kwargs={
        'storage_options':{"remote_options":{'anon':True}}
    }
)
ds_reopened
image

and one more time

vds_modified = vds.set_coords(['lev_bnds'])
vds_modified.virtualize.to_kerchunk(
    'testing.parquet', format="parquet"
)
ds_reopened = xr.open_dataset(
    'testing.parquet',
    engine='kerchunk',
    backend_kwargs={
        'storage_options':{"remote_options":{'anon':True}}
    }
)
ds_reopened

Both give

image

which is the same output as above. It is also the same output if I do not modify the coordinates at all!

vds_modified = vds
vds_modified.virtualize.to_kerchunk(
    'testing.parquet', format="parquet"
)
ds_reopened = xr.open_dataset(
    'testing.parquet',
    engine='kerchunk',
    backend_kwargs={
        'storage_options':{"remote_options":{'anon':True}}
    }
)
ds_reopened

So I think this might be a combination of #189 and a broken correspondence between the data_variables/coordinates order of the virtual dataset in memory and the ref on disk (or the way xarray is reading that back in).

@TomNicholas
Copy link
Member

Coordinates don't exist in zarrs model, so when Xarray opens a zarr store (or a kerchunk references representation of one), my understanding of how it determines zarr arrays should be set as coordinates is that it

  1. Makes any 1D variable with the same name as it's only dimension into a coordinate,
  2. looks at a 'coordinates' attribute in the metadata, which is deleted upon opening, and re-added when saving using .to_zarr,
  3. CF decoding can state that additional variables should be set as coordinates.

(would be great if you could confirm this @dcherian)

I believe right now VirtualiZarr handles (1) correctly, (2) has a bug (#189), and (3) it doesn't even try to do yet.

Ayush's PR just solves (2), but didn't get finished as it is without tests.

I tried to solve both (2) and (3) together in my PR by calling the same logic that Xarray uses when it does CF decoding. This is a bit of a rabbit hole though, and it would probably be better to just fix one thing at a time.

It would be great if one of you could pick up Ayush's (small!) PR and see if that solves your issue.

@jbusecke
Copy link
Contributor Author

This is not a burning priority for the meeting as far as I can tell right now.

Def struggling to get stuff sorted for the ESGF meeting next week, but please ping me after if there is still a need!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants