Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dtype not consistent for concat() #215

Closed
mdsumner opened this issue Aug 7, 2024 · 4 comments
Closed

dtype not consistent for concat() #215

mdsumner opened this issue Aug 7, 2024 · 4 comments

Comments

@mdsumner
Copy link
Contributor

mdsumner commented Aug 7, 2024

With these two NetCDF files, there was a change in the sequence from 'time' as double to float, I've chosen the two files that cross that modification date.

(also the main varnames changed order but that doesn't affect anything here).

from virtualizarr import open_virtual_dataset
import xarray

base = "https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr"
app = "#mode=bytes"

nc = ["202002/oisst-avhrr-v02r01.20200227.nc",
       "202002/oisst-avhrr-v02r01.20200228.nc"
]

## don't open as fs 
u = [f'{base}/{file}{app}' for file in nc]

## fine
vd = [
    open_virtual_dataset(filepath, indexes = {})
    for filepath in u
]
## but fails at concat because dtype of time has changed
xarray.concat(vd, dim = 'time', coords = 'minimal', compat = 'override')

## ... 
## ValueError: dtypes not all consistent


## both work fine when opened "normal way" with xarray
## (urls generated with "#mode=bytes" for remote open but with fsspec here as well else I couldn't get my comparison to work)

ufs = [fs.open(file) for file in u]
xarray.open_mfdataset(ufs)

xarray.concat([xarray.open_dataset(ufs[0]), xarray.open_dataset(ufs[1])], dim  = "time")

I just wonder if it's my responsibility to check those types, or if VirtualiZarr should have a virtualization of xarray's type-standardization here (eek). Frankly I think the files should stay consistent and be updated, which I'll pursue but I think this might be a worthwhile situation to catch.

(apologies if this is noise, still finding my way around)

@TomNicholas
Copy link
Member

TomNicholas commented Aug 7, 2024

Thanks for raising this @mdsumner .

I just wonder if it's my responsibility to check those types

Currently it is your responsibility. VirtualiZarr is effectively complaining because you're trying to make something that represents one Zarr Array but with two different dtypes inside it, which is a violation of Zarr's data model, where one array has only one dtype.

I think the simplest way would be for you to actually load the offending variables as "loadable_variables" rather than keep them as virtual ManifestArray objects. This will make those variables behave exactly the same way they do in xarray, at the cost of writing them out as actual binary data into your final zarr store / kerchunk reference files. Often you need to do this for time variables in particular anyway due to other reasons - see https://virtualizarr.readthedocs.io/en/latest/usage.html#loading-variables. If this solution works for you I might note it in that part of the docs.

@mdsumner
Copy link
Contributor Author

mdsumner commented Aug 7, 2024

Ah ok, thanks! I'll report to the provider too.

Degenerate rectilinear coords strikes again. It makes sense to load the trivial coords inline And, I realise, we could drop z which is useless here, and possibly prevent the churn of investigating lon,lat from every file ... I'll explore 👌

@mdsumner
Copy link
Contributor Author

mdsumner commented Aug 7, 2024

awesome, that works - thanks for the guidance! I'm becoming more comfy in the python space and will be exploring the docs a lot more.

from virtualizarr import open_virtual_dataset
import xarray

base = "https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr"
app = "#mode=bytes"

nc = ["202002/oisst-avhrr-v02r01.20200227.nc",
       "202002/oisst-avhrr-v02r01.20200228.nc"
]

## don't open as fs 
u = [f'{base}/{file}{app}' for file in nc]

## fine
vd = [
    open_virtual_dataset(filepath, indexes = {}, loadable_variables=['lon', 'lat', 'time', 'zlev'])
    for filepath in u
]
## succeeds at concat because while dtype of time has changed, we loaded the coord vars that are (logicially) constant
##  across all files rather than store them as reference ranges
xarray.concat(vd, dim = 'time', coords = 'minimal', compat = 'override')

@TomNicholas
Copy link
Member

I'm becoming more comfy in the python space
❤️

will be exploring the docs a lot more.

Great! This package is still very young, so let us know any way you think it can be improved.

awesome, that works

Great. I'm going to close this issue, and add a comment to #5 to indicate that a solution to that issue would have also allowed you to concatenate two arrays with different dtypes in general. (The solution I suggested here would be a bad idea for actually-large arrays.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants