A tool designed to convert IMOS NetCDF and CSV files into Cloud Optimised formats such as Zarr and Parquet
Visit the documentation on ReadTheDocs for detailed information.
- Conversion of CSV/NetCDF to Cloud Optimised format (Zarr/Parquet)
- YAML configuration approach with parent and child YAML configuration if multiple dataset are very similar (i.e. Radar ACORN, GHRSST, see config)
- Generic handlers for most dataset (GenericParquetHandler, GenericZarrHandler).
- Specific handlers can be written and inherits methods from a generic handler (Argo handler, Mooring Timseries Handler)
- Clustering capability:
- Local dask cluster
- Remote Coiled cluster
- driven by configuration/can be easily overwritten
- Zarr: gridded dataset are done in batch and in parallel with xarray.open_mfdataset
- Parquet: tabular files are done in batch and in parallel as independent task, done with future
- Reprocessing:
- Zarr,: reprocessing is achieved by writting to specific regions with slices. Non-contigous regions are handled
- Parquet: reprocessing is done via pyarrow internal overwritting function, but can also be forced in case an input file has significantly changed
- Chunking:
- Parquet: to facilitate the query of geospatial data, polygon and timestamp slices are created as partitions
- Zarr: done via dataset configuration
- Metadata:
- Parquet: Metadata is created as a sidecar _metadata.parquet file
- Unittesting of module: Very close to integration testing, local cluster is used to create cloud optimised files
Requirements:
- Python >= 3.10.14
- AWS SSO to push files to S3
- An account on Coiled for remote clustering (Optional)
curl -s https://raw.githubusercontent.com/aodn/aodn_cloud_optimised/main/install.sh | bash
Otherwise, go to the release page.
A curated list of Jupyter Notebooks ready to be loaded in Google Colab and Binder. Click on the badge above