Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Entrypoint for storing metadata that scales with number of chunks #305

Open
TomNicholas opened this issue Aug 10, 2024 · 4 comments
Open

Comments

@TomNicholas
Copy link
Member

I think we need a single general solution for how to store metadata which scales with the number of chunks in an array.

Context

Zarr aims to be arbitrarily scalable, via the assumption that in the model of zarr.json metadata + chunks of bytes then it doesn't matter how many chunks there are in a given zarr array, the metadata for that array will be of a constant small manageable size.

This assumption is broken in multiple proposed zarr enhancements that I am aware of:

  1. Chunk manifests Manifest storage transformer #287
  2. Chunk-level summary statistics https://github.com/zarr-developers/zeps/blob/main/draft/ZEP0005.md
  3. Variable-length chunking Feature request - variable chunking #138
  4. Martin's "context" idea Manifest storage transformer #287 (comment)

In all of these cases there is some type of metadata that we want to include in the store the size of which grows with the number of chunks. In (1) the proposal is to store a path, offset, and byte range for each chunk. In (2) it's to store a set of scalars per chunk (e.g. mean, median, mode, min, max). In (3) it's to store a 1D series of the lengths of each chunk along each dimension, which therefore scales with the number of chunks along one dimension, rather than the total number of chunks. The "context" idea in (4) I believe was to have certain zarr-array-level metadata, particularly related to encoding, be defineable on a per-chunk basis. My understanding is that that idea failed to gain traction but it's still another example of wanting to save per-chunk information in the store. I imagine there might be more ideas with this property in the future.

Problem of scale

We need the ability for each of these types of metadata to also be arbitrarily large. For example I personally want to use the chunk manifest idea to create a "virtual" zarr store with arrays which each have ~500k chunks, which if stored as json would imply requiring ~0.5MB per array to store the metadata comprising the chunk manifest alone (and I have ~20 arrays like that so 10MB of metadata already).

All the use cases above have different approaches to this same problem. In the storage manifest transformer proposal (1) there is basically a section in the zarr.json which tells the reader that to go and get a certain piece of metadata it has to look in a separate file (the manifest.json). We then discussed whether that should actually be a parquet file or even another zarr array (with shape = chunk grid shape). (2) similarly proposes solving this using additional zarr arrays in a special _accumulation_group in the store. (3) doesn't solve it either, as the chunk sizes are just an array in the json metadata file (though it does mention parquet as used by kerchunk to solve this problem).

This problem has been identified as separable and some specific solutions proposed, e.g. a zarr v3 extension for an external attributes file #229 (comment) (@rabernat) and support for non-JSON metadata and attributes #37, but those comments don't really identify the common thread of metadata with scales with number of chunks.

General entrypoint for metadata which scales with # of chunks

The variable-length chunking example seems like a particularly important example - in the chunk manifest example you only need to know what's in the manifest when you actually want to read bytes of data from the store, but with variable-length chunks you might want to know those lengths even when you merely list the contents of the store.

So whilst the suggested implementation for the chunk manifest within the v3 spec is to use a storage transformer, I wonder if that approach wouldn't actually really work for the other use cases above, and instead we should have some dedicated mechanism to use every time we want to have any metadata field which scales with the number of chunks. It would have a common syntax within the zarr.json file, but then we could either make a choice about what format in which to store the chunk-scaling metadata (e.g. parquet or zarr) or try to have that be flexible too (allowing for pointing to a database or whatever).

(In)compatibility with the v3 Spec

I spoke to @d-v-b about this at length and he seemed to think that there is no easy way to do this kind of arbitrary re-direction of metadata within the confines of the current v3 spec. My understanding of his argument is that in v3 right now all the top-level metadata that might be needed at store listing time must be in a single self-contained zarr.json file per array. If there is some way to get around this within v3 I would be happy to be proven wrong here though!

Note this never came up in v2 as none of the above features were present.

Looking forward

We in the geosciences community really really want chunk manifests in zarr because the vast majority of our data is HDF5/netCDF4, and with that once we start treating zarr as an over-arching "Super-Format" we will also have a pretty strong reason to want variable-length chunks. If we cannot do this at scale within v3 in the worst case we may end up going outside of v3 to get these features, which I think is an argument for seeing if we can squeeze something into the spec that would support this at the 11th hour.

Thoughts?

cc @jhamman @manzt @martindurant @jbms @joshmoore @AimeeB

@d-v-b
Copy link
Contributor

d-v-b commented Aug 12, 2024

thanks for raising this issue @TomNicholas! This is very related to the discussion here: #72

to elaborate on this point:

I spoke to @d-v-b about this at length and he seemed to think that there is no easy way to do this kind of arbitrary re-direction of metadata within the confines of the current v3 spec. My understanding of his argument is that in v3 right now all the top-level metadata that might be needed at store listing time must be in a single self-contained zarr.json file per array. If there is some way to get around this within v3 I would be happy to be proven wrong here though!

After thinking a bit more about it, I actually see two ways to make this happen in zarr v3. The first way uses the attributes field of zarr.json, the second side-steps it.

  1. We define a special JSON value that encodes where / how the "big metadata" is stored, and use this in the attributes key of zarr.json. E.g., {"zarr_attributes": {"location": "zarr.sqlite", "format": "sqlite"}}... basically a redirect for the attributes field. Zarr consumers who understand this value can follow the redirection; zarr consumers who do not understand the value will simply see a JSON object. This approach completely subsumes the attributes key of zarr.json, so there is a risk that a zarr implementation that doesn't understand the redirection permits mutation of the attributes field which destroys the redirect. But we do preserve the key space of zarr.json, which is nice.

  2. We add a new key to zarr.json that specifies the location / format of big metadata store. With this approach, we could keep the zarr.json:attributes key intact, and there is minimal risk that a non-complying Zarr implementation accidentally overwrites the value under the new key. We would still have to decide how the namespaces of the big metadata store and the attributes key relate to each other. And I don't know how zarr implementations today handle extra keys in zarr.json (or what the spec says about it).

I see pros and cons for either of these approaches, but I would love to see someone take either one and implement it.

@rabernat
Copy link
Contributor

rabernat commented Aug 12, 2024

My latest thinking on both this issue and chunk manifests involves going back to a concept that I have previously been critical of--delegating more responsibility and functionality to the Store. Currently Stores can only tell us basically three pieces of information about chunks: "does the chunk exist?", "how big is it?" and "what are its bytes?". These are the type of information that filesystems can readily provide. But there's no reason that a more specialized store couldn't hold more information about each chunk.

Of course, this just defers the problem of how to store this metadata to the Store. But that might be okay. That storage spec can evolve separately from the Zarr spec itself. If we go this route, the main thing we would have to define at the Zarr level is the interface for getting and setting chunk-level metadata, rather than the exact format.

At Earthmover, we are working on a new, open-source and open-spec Zarr store that comprises transactional updates, chunk manifests, and chunk-level metadata. I believe that the our design addresses all of the scalability concerns identified above. We aim to have something to share soon. Sorry for being vague--our intention is to release this in a fully baked form.

@martindurant
Copy link
Member

I'll add a couple of asides:

  • var-chunks does not in general scale as the size of the data, but slower, because you have multiple dimensions. Arrays with millions of chunks may require at most 100s of chunk size values and easily fit in any JSON always.
  • the per-chunk codec parameters was a=only one part of what contexts was about (you didn't say otherwise, just clarifying). We do have contexts in some of V2 already, they just don't pass anything useful.

In general, storing other per-chunk information in a sidecar binary file (sqlite, parquet, whatever) if fine, and what kerchunk already does after all. If you want such information to make it to the codecs, then you indeed also need a way to pass those things around - and that is where contexts come in.

delegating more responsibility and functionality to the Store

We don't want de/encoding to be done by the store, however. In such a model, a store and its unique internal implementations becomes the whole of zarr, for each store type.

@TomNicholas
Copy link
Member Author

the main thing we would have to define at the Zarr level is the interface for getting and setting chunk-level metadata, rather than the exact format.

This seems reasonable to me.

That storage spec can evolve separately from the Zarr spec itself.

Where is the v3 storage spec? It just says "under construction".

I believe that the our design addresses all of the scalability concerns identified above. We aim to have something to share soon. Sorry for being vague--our intention is to release this in a fully baked form.

Looking forward to discussing this once we hear the details!


var-chunks does not in general scale as the size of the data, but slower, because you have multiple dimensions.

It can scale more slowly, but it still scales with the data, which IMO means there is still a potential for scaling issues.

Arrays with millions of chunks may require at most 100s of chunk size values and easily fit in any JSON always.

May require, but only if none of the dimensions are much longer than the others.

In the pathological case of a 1D zarr store it scales just as badly as the chunk manifest does! Whilst that's the worst case, it doesn't seem unlikely that people make quasi-1D stores to hold e.g. time series or genomics datasets. A store like that with millions of chunks may still have 100's of thousands of chunk length values.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants