dtype not consistent for concat() #215

mdsumner · 2024-08-07T05:27:12Z

With these two NetCDF files, there was a change in the sequence from 'time' as double to float, I've chosen the two files that cross that modification date.

(also the main varnames changed order but that doesn't affect anything here).

from virtualizarr import open_virtual_dataset
import xarray

base = "https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr"
app = "#mode=bytes"

nc = ["202002/oisst-avhrr-v02r01.20200227.nc",
       "202002/oisst-avhrr-v02r01.20200228.nc"
]

## don't open as fs 
u = [f'{base}/{file}{app}' for file in nc]

## fine
vd = [
    open_virtual_dataset(filepath, indexes = {})
    for filepath in u
]
## but fails at concat because dtype of time has changed
xarray.concat(vd, dim = 'time', coords = 'minimal', compat = 'override')

## ... 
## ValueError: dtypes not all consistent


## both work fine when opened "normal way" with xarray
## (urls generated with "#mode=bytes" for remote open but with fsspec here as well else I couldn't get my comparison to work)

ufs = [fs.open(file) for file in u]
xarray.open_mfdataset(ufs)

xarray.concat([xarray.open_dataset(ufs[0]), xarray.open_dataset(ufs[1])], dim  = "time")

I just wonder if it's my responsibility to check those types, or if VirtualiZarr should have a virtualization of xarray's type-standardization here (eek). Frankly I think the files should stay consistent and be updated, which I'll pursue but I think this might be a worthwhile situation to catch.

(apologies if this is noise, still finding my way around)

TomNicholas · 2024-08-07T15:37:45Z

Thanks for raising this @mdsumner .

I just wonder if it's my responsibility to check those types

Currently it is your responsibility. VirtualiZarr is effectively complaining because you're trying to make something that represents one Zarr Array but with two different dtypes inside it, which is a violation of Zarr's data model, where one array has only one dtype.

I think the simplest way would be for you to actually load the offending variables as "loadable_variables" rather than keep them as virtual ManifestArray objects. This will make those variables behave exactly the same way they do in xarray, at the cost of writing them out as actual binary data into your final zarr store / kerchunk reference files. Often you need to do this for time variables in particular anyway due to other reasons - see https://virtualizarr.readthedocs.io/en/latest/usage.html#loading-variables. If this solution works for you I might note it in that part of the docs.

mdsumner · 2024-08-07T20:46:35Z

Ah ok, thanks! I'll report to the provider too.

Degenerate rectilinear coords strikes again. It makes sense to load the trivial coords inline And, I realise, we could drop z which is useless here, and possibly prevent the churn of investigating lon,lat from every file ... I'll explore 👌

mdsumner · 2024-08-07T21:38:57Z

awesome, that works - thanks for the guidance! I'm becoming more comfy in the python space and will be exploring the docs a lot more.

from virtualizarr import open_virtual_dataset
import xarray

base = "https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr"
app = "#mode=bytes"

nc = ["202002/oisst-avhrr-v02r01.20200227.nc",
       "202002/oisst-avhrr-v02r01.20200228.nc"
]

## don't open as fs 
u = [f'{base}/{file}{app}' for file in nc]

## fine
vd = [
    open_virtual_dataset(filepath, indexes = {}, loadable_variables=['lon', 'lat', 'time', 'zlev'])
    for filepath in u
]
## succeeds at concat because while dtype of time has changed, we loaded the coord vars that are (logicially) constant
##  across all files rather than store them as reference ranges
xarray.concat(vd, dim = 'time', coords = 'minimal', compat = 'override')

TomNicholas · 2024-08-08T15:49:43Z

I'm becoming more comfy in the python space
❤️

will be exploring the docs a lot more.

Great! This package is still very young, so let us know any way you think it can be improved.

awesome, that works

Great. I'm going to close this issue, and add a comment to #5 to indicate that a solution to that issue would have also allowed you to concatenate two arrays with different dtypes in general. (The solution I suggested here would be a bad idea for actually-large arrays.)

TomNicholas closed this as completed Aug 8, 2024

TomNicholas mentioned this issue Aug 8, 2024

Virtual concatenation of arrays with different codecs or dtypes #5

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dtype not consistent for concat() #215

dtype not consistent for concat() #215

mdsumner commented Aug 7, 2024 •

edited

Loading

TomNicholas commented Aug 7, 2024 •

edited

Loading

mdsumner commented Aug 7, 2024

mdsumner commented Aug 7, 2024

TomNicholas commented Aug 8, 2024

dtype not consistent for concat() #215

dtype not consistent for concat() #215

Comments

mdsumner commented Aug 7, 2024 • edited Loading

TomNicholas commented Aug 7, 2024 • edited Loading

mdsumner commented Aug 7, 2024

mdsumner commented Aug 7, 2024

TomNicholas commented Aug 8, 2024

mdsumner commented Aug 7, 2024 •

edited

Loading

TomNicholas commented Aug 7, 2024 •

edited

Loading