Skip to content

Commit

Permalink
Merge pull request #89 from pangeo-forge/local-docs
Browse files Browse the repository at this point in the history
Add documentation on how to run recipes locally
  • Loading branch information
moradology authored Feb 13, 2024
2 parents 3360d5d + 09c4710 commit 0eda0ca
Show file tree
Hide file tree
Showing 2 changed files with 145 additions and 0 deletions.
1 change: 1 addition & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ feedstocks
```{toctree}
:maxdepth: 1
tutorial/local
tutorial/flink
```

Expand Down
144 changes: 144 additions & 0 deletions docs/tutorial/local.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,144 @@
# Running a recipe locally

`pangeo-forge-runner` supports baking your recipes locally, primarily so you
can test the exact setup that will be used to bake your recipe on the cloud.
This allows for fast iteration on your recipe, while guaranteeing that the
behavior you see on your local system is what you will get when running
scaled out on the cloud.

## Clone a sample recipe repo to work on

This tutorial will work with any recipe, but to simplify things we will use
this pruned [GPCP Recipe](https://github.com/pforgetest/gpcp-from-gcs-feedstock/)
that pulls a subset of GPCPC netcdf files from Google Cloud storage and writes it
out as Zarr. The config we have setup for `pangeo-forge-runner` will fetch the
files from remote storage only once on your system, caching it so future runs
will be faster.

This same setup would work for any recipe!

1. Clone a copy of the recipe to work on:

```bash
git clone https://github.com/pforgetest/gpcp-from-gcs-feedstock
cd gpcp-from-gcs-feedstock
```

You can make edits to this if you would like.

2. Setup a virtual environment that will contain `pangeo-forge-runner` and
any other dependencies this recipe will need. We use a `venv` here,
but you may also use `conda` or other python package management setup you
are familiar with.

```bash
python -m venv venv
source venv/bin/activate
```

3. Install `pangeo-forge-runner` into this environment.

```bash
pip install pangeo-forge-runner
```

Now you're ready to go!
## Setting up config file
Construct a `local_config.py` file that describes where the output
data should go, and what should be used for caching the input files. Since we just
want to test locally, these can point to the local filesystem!
```python
# Let's put all our data on the same dir as this config file
from pathlib import Path
import os
HERE = Path(__file__).parent

DATA_PREFIX = HERE / 'data'
os.makedirs(DATA_PREFIX, exist_ok=True)

# Target output should be partitioned by job id
c.TargetStorage.root_path = f"{DATA_PREFIX}/{{job}}"

c.InputCacheStorage.fsspec_class = c.TargetStorage.fsspec_class
c.InputCacheStorage.fsspec_args = c.TargetStorage.fsspec_args

# Input data cache should *not* be partitioned by job id, as we want to get the datafile
# from the source only once
c.InputCacheStorage.root_path = f"{DATA_PREFIX}/cache/input"

c.MetadataCacheStorage.fsspec_class = c.TargetStorage.fsspec_class
c.MetadataCacheStorage.fsspec_args = c.TargetStorage.fsspec_args
# Metadata cache should be per job, as kwargs changing can change metadata
c.MetadataCacheStorage.root_path = f"{DATA_PREFIX}/{{job}}/cache/metadata"
```
This will create a directory called `data` in the same directory this
config file is located in, and put all outputs and caches in there. To
speed up multiple runs, input files will be cached under the `data/cache`
directory.
## Run a pruned version of your recipe
You're all set to run your recipe now!
```bash
pangeo-forge-runner bake \
--config local_config.py \
--repo . \
--Bake.job_name=test1 \
--prune
```
This should run for a few seconds, and your output Zarr should now be
in `output/tests1`! Let's explore the various parameters passed.
1. `--config local_config.py` specifies the config file we want `pangeo-forge-runner`
to read. If we were to try to run this on GCP or AWS, we can have additional
`aws_config.py` or `gcp_config.py` files, and just pass those instead - everything
else can remain the same. By putting most config into files, this also eases
collaboration - multiple people can know they're running the same config.
2. `--repo .` specifies that we want the current directory to be treated as a recipe
and run. This can instead point to a git repo, zenodo URI, etc as needed.
3. `--Bake.job_name=test1` specifies a unique job name for this particular run.
In our `local_config.py`, we use this name to create the output directory. If
not specified, this would be autogenerated.
4. `--prune` specifies we only want to run the recipe on about 2 input files, rather
than on everything. This makes for fast turnaround time and easy testing.
You can test the created Zarr store by opening it with `xarray`
```python
>>> import xarray as xr
>>> ds = xr.open_zarr("data/test1/gpcp")
>>> ds
<xarray.Dataset>
Dimensions: (latitude: 180, nv: 2, longitude: 360, time: 2)
Coordinates:
* latitude (latitude) float32 -90.0 -89.0 -88.0 -87.0 ... 87.0 88.0 89.0
* longitude (longitude) float32 0.0 1.0 2.0 3.0 ... 356.0 357.0 358.0 359.0
* time (time) datetime64[ns] 1996-10-01 1996-10-02
Dimensions without coordinates: nv
Data variables:
lat_bounds (latitude, nv) float32 dask.array<chunksize=(180, 2), meta=np.ndarray>
lon_bounds (longitude, nv) float32 dask.array<chunksize=(360, 2), meta=np.ndarray>
precip (time, latitude, longitude) float32 dask.array<chunksize=(1, 180, 360), meta=np.ndarray>
time_bounds (time, nv) datetime64[ns] dask.array<chunksize=(1, 2), meta=np.ndarray>
Attributes: (12/41)
Conventions: CF-1.6, ACDD 1.3
Metadata_Conventions: CF-1.6, Unidata Dataset Discovery v1.0, NOAA ...
acknowledgment: This project was supported in part by a grant...
cdm_data_type: Grid
cdr_program: NOAA Climate Data Record Program for satellit...
cdr_variable: precipitation
... ...
sensor: Imager, TOVS > TIROS Operational Vertical Sou...
spatial_resolution: 1 degree
standard_name_vocabulary: CF Standard Name Table (v41, 22 February 2017)
summary: Global Precipitation Climatology Project (GPC...
time_coverage_duration: P1D
title: Global Precipitation Climatatology Project (G...
>>>
```

0 comments on commit 0eda0ca

Please sign in to comment.