-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Config to Pass Around Injected Args #652
base: main
Are you sure you want to change the base?
Conversation
for more information, see https://pre-commit.ci
This will only work on pangeo_forge_recipes, and not on other libraries that may be built on it (like https://github.com/yuvipanda/pangeo-forge-cmr). I also think either there should be a config object or we do dynamic injection, I think doing both is confusing. My preferred solution is still to just support definining injectsion in the Also I can't access the slack link :) |
I know this 😆 IMHO the less injection code the better
Good point 🥳 I guess I don't understand what's going on here. This is a plugin of sorts? Why wouldn't this just be another recipe since |
I also cannot see the slack link, can you post the example linked there publicly somewhere? Can comment more directly on the PR once I see the problem it is trying to solve. |
@yuvipanda and @cisaacstern: I updated the description above with a problem statement, please refer there |
@ranchodeluxe thanks for the clarification. I understand the use case now, and think I am in favor. Curious to get Yuvi's take as he is more familiar with the design considerations here. As a bit of a tangent, in the linked example, I don't think I understand what the WriteReferences transform does that our built in WriteCombinedReference transform does not? |
yeah, I'm fine doing what folks think is best for this pattern
Yeah, where I linked is not clear. He uses that reference to pull out the |
Right, got that. My question is how is that custom transform any different from our built-in: pangeo-forge-recipes/pangeo_forge_recipes/transforms.py Lines 449 to 451 in 5367f63
And I think the answer is just that it passes through the written dataset? Which if true, means that the custom transform is just our built-in plus #636? If so, then can this use case be addressed by just finishing #636 and using the built-in? |
yeah, I think I'm following and agree |
Cool. 😎 So actually this discussion has me leaning slightly against merging this PR now? Because I think we want to discourage custom, un-released transforms which are sufficiently generalizable as to require configurable injections? That is, if a transform is so foundational to the pipeline that it requires knowing the cache or target location, then maybe its a code smell if it's not released somewhere? (Because such a significant piece of logic is presumably re-usable by other community members?) Not to discourage innovation/creativity, but I tend to think of custom transforms as best for data munging/preprocessing/etc., and caching/writing to be generalizable operations which we should be collaborating on via some released package? Thoughts? |
@cisaacstern We can talk about this in slack, but I wanted to consolidate the metadata in the references before writing them out. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes sense to try to upstream @sbquinlan's custom kerchunk writer here eventually, but regardless, I take the point that it's reasonable for recipe developers to access the injected config! (Including for cases where that's useful for prototyping things that will eventually be upstreamed! 😄 )
In terms of the PR itself, I think I prefer the simpler "Config" (instead of "RecipesConfig"). Although this package is "pangeo_forge_recipes", the word "recipes" is not generally used in the codebase itself currently, so the more generic "Config" makes sense to me.
pangeo_forge_recipes/config.py
Outdated
target_root: Union[str, FSSpecTarget, RequiredAtRuntimeDefault] = field( | ||
default_factory=RequiredAtRuntimeDefault | ||
) | ||
cache: Optional[Union[str, CacheFSSpecTarget]] = "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
while we're talking about naming, can you name this input_cache
and maybe add the metadata_cache
field here as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
metadata_cache
will add
input_cache
as long as folks are comfortable with this but it's not backward compatible (or haven't thought about how to achieve that yet)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm being silly. metadata_cache
isn't a thing at all and I think cache
is covering both from what I see.
Also, MetadataCacheStore
in runner doesn't have matching storage class in recipes lib that I'm seeing so this isn't being used 🤔
Let me look at this more later today
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MetadataCacheStorage in runner exists for backward compatibility with recipes < 0.10.0. You are correct that it does not exist in recipes >= 0.10.0, and fwiw this is not because InputCacheStorage is playing two roles, but rather because metadata caching is not part of the >= 0.10.0 paradigm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re: naming, there are (unfortunately) two naming conventions for (roughly) the same things here:
- names in the context of injections
- names in the context of kwargs on PTransforms
I suggest we chose one of the two and just adhere to it exactly, rather than introducing a third convention.
Seems like the injections names are more generic-sounding, and verbose, so that is probably the right one to use, meaning the fields on Config would be:
- Config.target_storage
- Congig.input_cache_storage
I don't enjoy the way the word "recipe" rolls off my tongue so this is a good change. I prefer referring to these workflows as "KitchenSequence"(s) and "BakeChain"(s) 😏 |
Co-authored-by: Charles Stern <[email protected]>
Co-authored-by: Charles Stern <[email protected]>
Co-authored-by: Charles Stern <[email protected]>
for more information, see https://pre-commit.ci
…ge/pangeo-forge-recipes into gcorradini/configject
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Getting close, @ranchodeluxe, thanks for pushing on this.
A few small notes on docs/docstring, and also a bigger-picture question re: if we can leverage this internally as well.
Looking forward to your thoughts!
"StoreToZarr": { | ||
"target_root": "TARGET_STORAGE", | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"StoreToZarr": { | |
"target_root": "TARGET_STORAGE", | |
}, |
Should we take this a step further, and actually use this config object internally?
Meaning delete the StoreToZarr, WriteCombinedReference, and OpenURLWithFSSpec injections specs here, and instead doing something like:
# transforms.py
from .config import Config
c = Config()
...
class StoreToZarr(PTransform):
...
- target_root: Union[str, FSSpecTarget, RequiredAtRuntimeDefault] = field(
- default_factory=RequiredAtRuntimeDefault
+ target_root: Union[str, FSSpecTarget, RequiredAtRuntimeDefault] = c.target_storage
Without fiddling with it a bit I'm not sure exactly how this would work, but conceptually, what do we think about making Config
the interface to all injections, including for internal transforms?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do enjoy where you're going with this ✨ Given that dependency injection is targeting the recipe this won't work AFAICT without more complicated acrobatics in the AST rewriter which I think we should avoid. So my feedback here is that we just do what you are proposing in the recipes explicitly:
benefits
-
recipes <
0.10.x
(some version with this change) can easily be backward compatible and still leverage the existing dep injection -
recipes >=
0.10.x
(some version with this change) are now explicit and folks need less explanation about how this all works (except the explanation about dep injection). For example, with this change you probably don't need to even write documentation about how to write custom transforms b/cStoreToZarr
,WriteCombinedReference
andOpenURLWithFSSpec
recipe examples already show it how it works -
it's only a couple/few call sites we'd be passing
conf.target_storage
andconf.input_cache_storage
to per recipe
drawbacks
- if the number of transforms we write requiring
conf.target_storage
andconf.input_cache_storage
dramatically increases then, yes, it'd seem we might be doing an unnecessary amount of input passing
pangeo_forge_recipes/config.py
Outdated
"""a simple class that contains all possible injection spec values | ||
|
||
folks can import it into their recipes to pass injection spec values | ||
around to custom beam routines as the example below shows |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Conventionally, our docstrings tend to be a bit less colloquial, could we formalize the language a bit here? Thanks!
@dataclass | ||
class Config: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add this to the API Reference section of the docs.
And possibly link to it in either/or:
- The discussion of "Deploy-time configurable keyword arguments" under **Recipe composition > Transforms"
- a new Advanced Topics section, e.g. "Custom Transforms"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks 💯 I'll get around to this after hearing back from Yuvi and other folks
I totally agree too! I have many strong opinions, particularly around #652 (comment):
I feel like there was a lot of context around this for why we settled on an injection based workflow, vs various other options that were considered at that time. I know this happened during the meetings, but do you know if there is a solid written record of it somewhere, @cisaacstern? How urgent is this PR? I would like to provide a more solid explanation here, but I just spent far too long on airplanes, and don't want to write a half-assed response here before my brain recovers. |
From my perspective it is worth waiting to get your review, @yuvipanda. Hopefully the DevSeed folks can run anything they need from this branch until then! |
@cisaacstern and @norlandrhagen : Rebasing this. Hoping we can tidy this up quick and merge this to address the limitations in #668 even though the discussion stalled about where/how much we should use this pattern |
Ignore this. Won't work given how the AST rewriter works 😞. Back to the drawing board. Sorry about the noise |
I've been staring at this issue (I'm convinced that a backdoor to arbitrary runtime config is probably useful) and trying to wrap my head around the rationale for the complex injection strategy defined in Can someone talk me down? Convince me I'm just stupid, please! I will absolutely attempt to explain/document the reasoning for others similarly impaired |
@moradology let me try! These are all the design considerations I had when building out the dynamic injections. It's been a few months, and unfortunately many of these conversations happened in meetings without being written down in detail, so be gentle if I miss a few words here and there (or get things wrong!). You (and others!) probably already know all this information (and maybe more!), so think of this more of a historical explanation than an advocacy for status quo. Separation of config from the recipeThe same recipe / feedstock should be runnable on my local machine, on GCP and AWS. This was the core reason I started the runner project - prior to that, recipes were run by a heroku runner purely on dataflow. So this means that some config - particularly about where the data should go, must be defined by the bakery operator, via the traitlets config file, rather than by the recipe writer. But this information needed to be passed into the PTransforms somehow, so they actually know where to write the output to. This was the core of the problem that needed solving - how do we allow bakery operators to specify this information, rather than recipe writers? Parsing vs RunningCode written in a recipe is distinctly executed in two 'phases' (there may be beam terminology for this I'm not aware of):
This separation is hard to see sometimes - if you look at a function you write, is it going to get called where the runner is being called, at 'parse' phase? Or in flink remotely somewhere, 2 hours from now, at runtime? Sometimes the answer is both! There's also But fundamentally, the goal is for the generated DAG to be as pure as possible, not making any references to variables or values that are only available at parse time, but not runtime. Here's an example of a problem: import apache_beam as beam
import json
with read('/etc/target-config.json') as f:
config = json.load(f)
def output_path(item):
with open(config["outputPath"], "a") as out:
out.write(item + "\n")
with beam.Pipeline() as pipeline:
pipeline
| "Create elements" >> beam.Create(["Hello", "World!"])
| "Output elements" >> beam.Map(output_path) What happens when you run this locally? It works all fine, not a problem! But what happens if you run this on dataflow or flink? So a useful way to think about this is that Transforms really need to be pure functions, operating only on the parameters passed on to them. Even trying to use The injections work here by treating all transforms as pure functions, and passing in config to them the only way it's possible to pass information into a pure function - by adding it as a parameter. An elaborate partial that is done in such a way that when pickled, all the information needed for a transform to execute is already in that transform! This also means you can't inject things that can't be pickled, but we mostly inject strings and dicts so it's ok. The alternative we considered was environment variables, but PTransforms looking for I also didn't spend too many innovation points here, as this is mostly just the same system pytest uses. Separation of runner from
|
@yuvipanda This was an excellent writeup and incredibly valuable historical motivation for pieces that would have taken hours of code archaeology to uncover. Thank you for taking the time to write it all out and expose the constraints you were working around. Plenty to mull over here. And regardless of how things shake out, it will provide some top notch material to expose in comments and documentation |
Thanks, @moradology :) Glad to be of use. Personally, I would say that documenting very clearly this 'parse time' vs 'runtime' situation (in more detail than I have presented here) would be incredibly valuble. |
As an example, in this PR, And then the As described in this comment in this PR, I think doing this via |
At the risk of revealing ignorance, I'm wondering if a transition to lazy pipeline construction 168 on runner (i.e. expecting So, recipe would need to be something like What do you think, @yuvipanda? |
Unfortunately @moradology I do not have the capacity now to help figure it out, although I'd have loved to :( So this will have to go on without me. Am still happy to provide any historical context, but I think y'all have a good handle on how to take this one forward! |
💯 points for @yuvipanda for setting boundaries on uncompensated effort, a very difficult thing to do in open source! |
@yuvipanda Understood! I appreciate the context you've laid out for us here and think we should be in a position to keep things rolling. Obviously your wisdom and experience here is very welcome but as @cisaacstern points out, it is yours to give and not ours to demand. Thank you again |
You're welcome :) I'm very happy to see all the effort being poured into pangeo-forge - I think it's truly transformational technology, and I'm extremely excited to see where it goes. I'm very happy to still answer questions about specific historical choices made, but I also trust y'all to take the project forward :) |
I do see how this could be one way of making the config available to arbitrary transforms, but am not convinced (nor are you suggesting) that this alone is a strong enough justification for merging the linked runner PR (this is a simple, but not the only, way to get the config where we need it). I have some other thoughts on the linked runner PR which I'll share over there now. |
Problem
If someone wants to write a custom
beam.PTransform
orbeam.DoFn
in their recipe and they would like to leveragetarget_root
orcache
the only option (that is a heavy lift) is to make their recipe into a package with an entrypoint. There should be simple way to accommodate this. From this conversation on DevSeed'spangeo-forge-external
channel:@sbquinlan workaround: https://github.com/pangeo-forge/staged-recipes/pull/259/files#diff-c2ef5f081c41ff256ca9344512c7d7d8f369623a8b9483c6c1921cdafc2825deR220-R226
Goal
See docs string in diff for an example or this example recipe for local testing#651
I wanted to just inject the whole config dict into a simple variable at import or during ast traversal using
view_Assign
but I see we'd still have to modify all places where it's used so this seems to be the best option 🦐 🦃Runner Companion PR
pangeo-forge/pangeo-forge-runner#146