Merge pull request #89 from pangeo-forge/local-docs

Add documentation on how to run recipes locally
pangeo-forge · Feb 13, 2024 · 0eda0ca · 0eda0ca
2 parents 3360d5d + 09c4710
commit 0eda0ca
Show file tree

Hide file tree

Showing 2 changed files with 145 additions and 0 deletions.
diff --git a/docs/index.md b/docs/index.md
@@ -14,6 +14,7 @@ feedstocks
 ```{toctree}
 :maxdepth: 1
 
+tutorial/local
 tutorial/flink
 ```
 

diff --git a/docs/tutorial/local.md b/docs/tutorial/local.md
@@ -0,0 +1,144 @@
+# Running a recipe locally
+
+`pangeo-forge-runner` supports baking your recipes locally, primarily so you
+can test the exact setup that will be used to bake your recipe on the cloud.
+This allows for fast iteration on your recipe, while guaranteeing that the
+behavior you see on your local system is what you will get when running
+scaled out on the cloud.
+
+## Clone a sample recipe repo to work on
+
+This tutorial will work with any recipe, but to simplify things we will use
+this pruned [GPCP Recipe](https://github.com/pforgetest/gpcp-from-gcs-feedstock/)
+that pulls a subset of GPCPC netcdf files from Google Cloud storage and writes it
+out as Zarr. The config we have setup for `pangeo-forge-runner` will fetch the
+files from remote storage only once on your system, caching it so future runs
+will be faster.
+
+This same setup would work for any recipe!
+
+1. Clone a copy of the recipe to work on:
+
+   ```bash
+   git clone https://github.com/pforgetest/gpcp-from-gcs-feedstock
+   cd gpcp-from-gcs-feedstock
+   ```
+
+   You can make edits to this if you would like.
+
+2. Setup a virtual environment that will contain `pangeo-forge-runner` and
+   any other dependencies this recipe will need. We use a `venv` here,
+   but you may also use `conda` or other python package management setup you
+   are familiar with.
+
+   ```bash
+   python -m venv venv
+   source venv/bin/activate
+   ```
+
+3. Install `pangeo-forge-runner` into this environment.
+
+   ```bash
+   pip install pangeo-forge-runner
+   ```
+
+Now you're ready to go!
+
+## Setting up config file
+
+Construct a `local_config.py` file that describes where the output
+data should go, and what should be used for caching the input files. Since we just
+want to test locally, these can point to the local filesystem!
+
+```python
+# Let's put all our data on the same dir as this config file
+from pathlib import Path
+import os
+HERE = Path(__file__).parent
+
+DATA_PREFIX = HERE / 'data'
+os.makedirs(DATA_PREFIX, exist_ok=True)
+
+# Target output should be partitioned by job id
+c.TargetStorage.root_path = f"{DATA_PREFIX}/{{job}}"
+
+c.InputCacheStorage.fsspec_class = c.TargetStorage.fsspec_class
+c.InputCacheStorage.fsspec_args = c.TargetStorage.fsspec_args
+
+# Input data cache should *not* be partitioned by job id, as we want to get the datafile
+# from the source only once
+c.InputCacheStorage.root_path = f"{DATA_PREFIX}/cache/input"
+
+c.MetadataCacheStorage.fsspec_class = c.TargetStorage.fsspec_class
+c.MetadataCacheStorage.fsspec_args = c.TargetStorage.fsspec_args
+# Metadata cache should be per job, as kwargs changing can change metadata
+c.MetadataCacheStorage.root_path = f"{DATA_PREFIX}/{{job}}/cache/metadata"
+```
+
+This will create a directory called `data` in the same directory this
+config file is located in, and put all outputs and caches in there. To
+speed up multiple runs, input files will be cached under the `data/cache`
+directory.
+
+## Run a pruned version of your recipe
+
+You're all set to run your recipe now!
+
+```bash
+pangeo-forge-runner bake \
+    --config local_config.py \
+    --repo . \
+    --Bake.job_name=test1 \
+    --prune
+```
+
+This should run for a few seconds, and your output Zarr should now be
+in `output/tests1`! Let's explore the various parameters passed.
+
+1. `--config local_config.py` specifies the config file we want `pangeo-forge-runner`
+   to read. If we were to try to run this on GCP or AWS, we can have additional
+   `aws_config.py` or `gcp_config.py` files, and just pass those instead - everything
+   else can remain the same. By putting most config into files, this also eases
+   collaboration - multiple people can know they're running the same config.
+2. `--repo .` specifies that we want the current directory to be treated as a recipe
+   and run. This can instead point to a git repo, zenodo URI, etc as needed.
+3. `--Bake.job_name=test1` specifies a unique job name for this particular run.
+   In our `local_config.py`, we use this name to create the output directory. If
+   not specified, this would be autogenerated.
+4. `--prune` specifies we only want to run the recipe on about 2 input files, rather
+   than on everything. This makes for fast turnaround time and easy testing.
+
+You can test the created Zarr store by opening it with `xarray`
+
+```python
+>>> import xarray as xr
+>>> ds = xr.open_zarr("data/test1/gpcp")
+>>> ds
+<xarray.Dataset>
+Dimensions:      (latitude: 180, nv: 2, longitude: 360, time: 2)
+Coordinates:
+  * latitude     (latitude) float32 -90.0 -89.0 -88.0 -87.0 ... 87.0 88.0 89.0
+  * longitude    (longitude) float32 0.0 1.0 2.0 3.0 ... 356.0 357.0 358.0 359.0
+  * time         (time) datetime64[ns] 1996-10-01 1996-10-02
+Dimensions without coordinates: nv
+Data variables:
+    lat_bounds   (latitude, nv) float32 dask.array<chunksize=(180, 2), meta=np.ndarray>
+    lon_bounds   (longitude, nv) float32 dask.array<chunksize=(360, 2), meta=np.ndarray>
+    precip       (time, latitude, longitude) float32 dask.array<chunksize=(1, 180, 360), meta=np.ndarray>
+    time_bounds  (time, nv) datetime64[ns] dask.array<chunksize=(1, 2), meta=np.ndarray>
+Attributes: (12/41)
+    Conventions:                CF-1.6, ACDD 1.3
+    Metadata_Conventions:       CF-1.6, Unidata Dataset Discovery v1.0, NOAA ...
+    acknowledgment:             This project was supported in part by a grant...
+    cdm_data_type:              Grid
+    cdr_program:                NOAA Climate Data Record Program for satellit...
+    cdr_variable:               precipitation
+    ...                         ...
+    sensor:                     Imager, TOVS > TIROS Operational Vertical Sou...
+    spatial_resolution:         1 degree
+    standard_name_vocabulary:   CF Standard Name Table (v41, 22 February 2017)
+    summary:                    Global Precipitation Climatology Project (GPC...
+    time_coverage_duration:     P1D
+    title:                      Global Precipitation Climatatology Project (G...
+>>>
+```