Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

open_virtual_dataset fails to open tiffs #291

Open
elyall opened this issue Nov 8, 2024 · 20 comments · May be fixed by #295
Open

open_virtual_dataset fails to open tiffs #291

elyall opened this issue Nov 8, 2024 · 20 comments · May be fixed by #295
Labels
bug Something isn't working Kerchunk Relating to the kerchunk library / specification itself readers

Comments

@elyall
Copy link

elyall commented Nov 8, 2024

open_virtual_dataset says it supports TIFF files however it currently fails and is lacking test coverage. The issue starts here where the kerchunk reference to a zarr array is treated as an HDF reference that has one or more groups.

Code to reproduce
from pathlib import Path

import numpy as np
from PIL import Image
from virtualizarr import open_virtual_dataset
from virtualizarr.backend import FileType


def write_random_tiff(file_path):
    array = np.random.randint(0, 255, (128, 128), dtype=np.uint8)
    img = Image.fromarray(array)
    img.save(file_path)


file_path = Path.cwd() / "test.tif"
write_random_tiff(file_path)

open_virtual_dataset(str(file_path), filetype=FileType.tiff)

Throws:

File virtualizarr/translators/kerchunk.py:46, in extract_group(vds_refs, group)

ValueError: Multiple HDF Groups found. Must specify group= keyword to select one of []

virtualizarr = 1.1.0

@TomNicholas TomNicholas added bug Something isn't working Kerchunk Relating to the kerchunk library / specification itself readers labels Nov 8, 2024
@TomNicholas
Copy link
Member

TomNicholas commented Nov 8, 2024

Ah sorry about this @elyall !

We do have some rudimentary tests for tiff but as the tiff reader is just a thin wrapper around the kerchunk tiff reader I (naively) expected this to "just work". I agree that we should remedy that with better test coverage.

Thanks for the reproducible example, I'll have a look at what might be going on now.

@TomNicholas
Copy link
Member

So I've looked into this, and I'm a bit confused by what kerchunk returns, and am ignorant of exactly what information is stored in a tiff file. The kerchunk tiff reader basically returns this information about your example tiff file:

{'refs': {'.zattrs': '{"_ARRAY_DIMENSIONS":["Y","X"]}', '.zarray': '{\n "chunks": [\n  128,\n  128\n ],\n "compressor": null,\n "dtype": "|u1",\n "fill_value": 0,\n "filters": null,\n "order": "C",\n "shape": [\n  128,\n  128\n ],\n "zarr_format": 2\n}', '0.0': ['/private/var/folders/zt/b84f41811p5_vygj1gjp6qx40000gn/T/pytest-of-tom/pytest-38/test_random_tiff0/rand.tiff', 122, 16384]}}

VirtualiZarr's tiff reader then attempts to parse this and wrap it up into a nicer xarray.Dataset containing virtualizarr.ManifestArray objects. That's the same way that several of VirtualiZarr's other readers work (and are much better tested).

What confuses me here is how am I supposed to know the name of the array you created? Do tiff files have named arrays/variables? Is the expected behaviour to default to some naming convention? Examples I find online of using rioxarray and rasterio are for opening GeoTIFF files - do they have more information in them about variable names? This is related to the failure you're seeing, because the virtualizarr parsing logic is expecting kerchunk to return something that has information about group names and array names.

Also is saying that the array dimensions are called "X" and "Y" correct? Or did kerchunk just make that up?

@dcherian
Copy link

dcherian commented Nov 9, 2024

Do tiff files have named arrays/variables? Is the expected behaviour to default to some naming convention?

They do not. there is just a single array, and sometimes overviews.

@TomNicholas TomNicholas mentioned this issue Nov 9, 2024
5 tasks
@TomNicholas
Copy link
Member

TomNicholas commented Nov 9, 2024

I've opened #292 to add your example as a test case and start trying to fix it @elyall.


Do tiff files have named arrays/variables? Is the expected behaviour to default to some naming convention?

They do not. there is just a single array, and sometimes overviews.

Interesting... So if it's just one unnamed array, then "open_dataset" doesn't even make sense - the tiff reader should really only support open_dataarray, where we return a single unnamed xr.DataArray.

EDIT: My suggestion seems consistent with what open_rasterio does, returning a single DataArray - https://knowledge.dea.ga.gov.au/notebooks/How_to_guides/Opening_GeoTIFFs_NetCDFs/#Opening-a-single-GeoTIFF

@elyall
Copy link
Author

elyall commented Nov 9, 2024

Yes a dataarray is a better fit. I just need the ManifestArray object.

@TomNicholas
Copy link
Member

Cool. I'll be able to continue this next week, but you're welcome to work on it instead of you're interested.

@mdsumner
Copy link
Contributor

mdsumner commented Nov 9, 2024

There can be multiple arrays, including overviews. In fact before GeoTIFF had explicit overviews the multiple arrays (subdatasets in GDAL parlance) were used for overviews in some software (maybe only one tho ...).

I'm interested to explore this so hoping I can help a little 🙏

@elyall
Copy link
Author

elyall commented Nov 9, 2024

@mdsumner I'm not familiar with GeoTIFF but I'm guessing it's similar to other TIFF specs I'm aware of: a multi-page TIFF with specific metadata.

Readers like tifffile can read individual pages of multi-page TIFFs so each page probably should be treated as individual chunks wrapped in their own ManifestArray. And the default keys could be the page indices. There are lots of TIFF metadata specs out there and a subset probably have something that translates well into keys, but handling all of those cases would be difficult. It would be easiest to just allow the user to inject keys as an argument during creation.

@mdsumner
Copy link
Contributor

mdsumner commented Nov 9, 2024

Does multi-page mean (multi)subdataset analogously to multi-variable in NetCDF?

I'm not familiar with the tiff spec, mostly only with GeoTIFF usage details and the abstractions

"Pages" could be like bands, or layers, like actual pages in a pdf where the shape doesn't change, nor the pixel type. Sadly format idioms leak into these abstractions, so please indulge my questions I expect you'll have the same needs for clarification 🙏

@TomNicholas
Copy link
Member

TomNicholas commented Nov 9, 2024

Thanks both of you.

Readers like tifffile can read individual pages of multi-page TIFFs so each page probably should be treated as individual chunks wrapped in their own ManifestArray.

Does multi-page mean (multi)subdataset analogously to multi-variable in NetCDF?

This was going to be my question too. In Xarray terminology what we need to know is whether each page of a multi-page TIFF maps to

  1. Different slices (potentially chunked) of a single Variable/DataArray (that's what your example implies @elyall - maybe this is all we need),
  2. Different (named) variables in the same Dataset (i.e. different arrays that can have different shapes but any common dimensions have consistent shape),
  3. Something even more messy, which might only be representable as groups of a xarray.DataTree, or a list of unnamed DataArrays.

I'm keen to try and get this working, but it would be much appreciated if someone who actually works with (Geo)TIFF files regularly was interesting in helping maintain VirtualiZarr's TIFF reader. The rest of the package is designed to be as modular as possible to minimize the effort required, and we already have some developers who have helpfully contributed and taken ownership over other specific readers (e.g. DMR++). cc'ing @scottyhq and @sharkinsspatial as people who may be interested, but any lurkers feel free to pipe up 🪈

@elyall
Copy link
Author

elyall commented Nov 9, 2024

I've only used option 1 where all pages are slices in one array and therefore have matching dimensions, but that is not always the case.

Here's a great overview of internal TIFF structure. The Pyramid TIFF is an example of arrays of different sizes (though same dimensions) that would require separate keys (in this case a DataTree would work, example here).

As we can see it's complicated since from the underlying structure there are a lot of different TIFF specs that require their own data loaders. All IFD and SubIFD ImageBytes should be individual ManifestArrays. The one consistent thing is they are sequential so as a start the keys/schema could be input as an argument. From there individual use cases could be handled as desired.

I'm happy to help and potentially maintain a limited scope reader (TIFF spec itself shouldn't change much) but not all of the many edge cases.

@mdsumner
Copy link
Contributor

I'm getting the same error still as the OP ??

import virtualizarr
virtualizarr.__version__
'1.1.1.dev12+g9e7d430'

from pathlib import Path

import numpy as np
from PIL import Image
from virtualizarr import open_virtual_dataset
from virtualizarr.backend import FileType


def write_random_tiff(file_path):
    array = np.random.randint(0, 255, (128, 128), dtype=np.uint8)
    img = Image.fromarray(array)
    img.save(file_path)


file_path = Path.cwd() / "test.tif"
write_random_tiff(file_path)

open_virtual_dataset(str(file_path), filetype=FileType.tiff)

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/scratch/pawsey0973/mdsumner/.local/lib/python3.12/site-packages/virtualizarr/backend.py", line 190, in open_virtual_dataset
    vds = backend_cls.open_virtual_dataset(
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/scratch/pawsey0973/mdsumner/.local/lib/python3.12/site-packages/virtualizarr/readers/tiff.py", line 48, in open_virtual_dataset
    refs = extract_group(refs, group)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/scratch/pawsey0973/mdsumner/.local/lib/python3.12/site-packages/virtualizarr/translators/kerchunk.py", line 46, in extract_group
    raise ValueError(
ValueError: Multiple HDF Groups found. Must specify group= keyword to select one of []

I'm really keen to explore this, but having issues getting a working environment. will try again later :)

@elyall
Copy link
Author

elyall commented Nov 10, 2024

I got a little ahead of myself as kerchunk's tiff_to_zarr should already solve most of this problem. The heavy lifting is actually occurring in tifffile, specifically in tifffile.ZarrTiffStore. This if statement determines whether the output is a single array (DataArray) or multiple arrays (Dataset/DataTree). If multiple arrays they are sequentially indexed as you can see here. However tiff_to_zarr then plays with some metadata and potentially changes the keys (I'll need to dig into it more).

It would be great if we had an example TIFF file that has multiple arrays that someone understands well to cover the Dataset route. I found this sample GeoTIFF but it appears to be a single array. I could generate a multiscale OME-TIFF file but the keys are sequential and the data fits better in a DataTree than a Dataset.

@mdsumner
Copy link
Contributor

there's a multi tiff here:

https://github.com/OSGeo/gdal/raw/refs/heads/master/autotest/gcore/data/twoimages.tif

gdalinfo /vsicurl/https://github.com/OSGeo/gdal/raw/refs/heads/master/autotest/gcore/data/twoimages.tif
Driver: GTiff/GeoTIFF
Files: /vsicurl/https://github.com/OSGeo/gdal/raw/refs/heads/master/autotest/gcore/data/twoimages.tif
Size is 20, 20
Image Structure Metadata:
  INTERLEAVE=BAND
Subdatasets:
  SUBDATASET_1_NAME=GTIFF_DIR:1:/vsicurl/https://github.com/OSGeo/gdal/raw/refs/heads/master/autotest/gcore/data/twoimages.tif
  SUBDATASET_1_DESC=Page 1 (20P x 20L x 1B)
  SUBDATASET_2_NAME=GTIFF_DIR:2:/vsicurl/https://github.com/OSGeo/gdal/raw/refs/heads/master/autotest/gcore/data/twoimages.tif
  SUBDATASET_2_DESC=Page 2 (20P x 20L x 1B)
Corner Coordinates:
Upper Left  (    0.0,    0.0)
Lower Left  (    0.0,   20.0)
Upper Right (   20.0,    0.0)
Lower Right (   20.0,   20.0)
Center      (   10.0,   10.0)
Band 1 Block=20x20 Type=Byte, ColorInterp=Gray

@TomNicholas
Copy link
Member

Oh tifffile allows you to open tiffs as Zarr arrays?! Let's just use that! Then we can skip the kerchunk step entirely. Our tiff reader could re-use code from the (WIP) Zarr reader in #271.

@TomNicholas
Copy link
Member

TomNicholas commented Nov 12, 2024

Okay I tried an using an approach like

from tifffile import imread

store = imread(filepath, aszarr=True)

# TODO exception handling for TIFF files with multiple arrays
za = zarr.open_array(store=store, mode="r")

but ended up down a rabbit hole because a) tifffile doesn't appear to support zarr-python v3 yet and b) determining the actual byte ranges is still tricky - in kerchunk.tiff.tiff_to_zarr I think it happens in the store.write_fsspec step.

EDIT: I posted where I got to in #295 in case it's ever of any use.

I think we should try to get the tiff reader working using an implementation that relies on the kerchunk reader first, then as a follow-up think about if we can eliminate kerchunk from that process.

@elyall
Copy link
Author

elyall commented Nov 13, 2024

a) tifffile doesn't appear to support zarr-python v3 yet

It does not at the moment.

and b) determining the actual byte ranges is still tricky - in kerchunk.tiff.tiff_to_zarr I think it happens in the store.write_fsspec step.

That's correct, and write_fsspec is a method of tifffile.ZarrTiffStore. It's designed to write to a file, but kerchunk.tiff.tiff_to_zarr writes it to an in-memory text buffer and then converts it back to json.

there's a multi tiff here:

@mdsumner That file has two arrays with the same shape and dimensions. tifffile.ZarrTiffStore.write_fsspec for better or worse treats them as a single array with two chunks, one for each image, which fits better in a xarray.DataArray. The example we need is one where a TIFF has multiple arrays of different shapes, dimensions, or coords (and probably keys for each array that hopefully tifffile can parse), which would need to be wrapped in a xarray.Dataset or xarray.DataTree. OR we need an example where tifffile.ZarrTiffStore.write_fsspec understands a TIFF spec and places references to matching arrays (same shape and dimensions) under different zarr groups.

I think we should try to get the tiff reader working using an implementation that relies on the kerchunk reader first, then as a follow-up think about if we can eliminate kerchunk from that process.

Sounds good. Right now I can help most with developing tests but I'll take a look at some of the other VirtualiZarr readers and try to get a grasp of what they ultimately produce.

@mdsumner
Copy link
Contributor

I think that's fine, multiple chunks, I think that's how they're intended to be interpreted, not as different arrays.(It's very old style)

@TomNicholas
Copy link
Member

Thanks both. I put my attempt to bypass kerchunk up in #295. I've also submitted #296 to change the documentation to no longer falsely advertise that we can read tiff files when we cannot yet.

I think we should try to get the tiff reader working using an implementation that relies on the kerchunk reader first, then as a follow-up think about if we can eliminate kerchunk from that process.

Sounds good. Right now I can help most with developing tests but I'll take a look at some of the other VirtualiZarr readers and try to get a grasp of what they ultimately produce.

It really shouldn't be particularly difficult to finish #292 - it mostly requires rejigging the virtualizarr.translators.kerchunk module to handle parsing the dict that kerchunk.tiff_to_zarr returns.

@elyall
Copy link
Author

elyall commented Nov 14, 2024

it mostly requires rejigging the virtualizarr.translators.kerchunk module to handle parsing the dict that kerchunk.tiff_to_zarr returns

I've done that here. The only real change was to have virtual_vars_from_kerchunk_refs accept arr_refs as input instead of refs. Now a single array as is produced from kerchunk.tiff_to_zarr can be passed in.

The question is, do you want to create a dataarray_from_kerchunk_refs equivalent to dataset_from_kerchunk_refs or do you want to add an optional key argument to dataset_from_kerchunk_refs that assigns the array to a key and then wraps it in a Dataset? Or another alternative I haven't thought of.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Kerchunk Relating to the kerchunk library / specification itself readers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants