Replies: 3 comments 7 replies
-
It seems like the changes driving our need for more compute resources are:
A cloud-based Dagster deployment seems like it is intended to handle these kinds of workflows. Like it's their main commercial use-case. Things that make me apprehensive about moving in this direction:
I can see the attraction of a hybrid system where some assets are just pulled from the most recent nightly build outputs, and others are computed locally, creating and maintaining our own system for managing these two different data sources seems like it could be a significant burden. One of Dagster's main selling points is supposed to be the ease with which different computing environments can be swapped in via different resource configurations, enabling a relatively seamless transition between local / testing / production setups. Do they have a canonical solution that isn't "Do all your compute on our service and pay a big markup!" DuckDB's commercial service seems to be aimed at facilitating the use of hybrid local/remote data. A related post from Dagster Are we inevitably headed towards a system where outside contribution is technically more difficult, and requires users to have cloud access and a budget (either theirs or ours)? |
Beta Was this translation helpful? Give feedback.
-
Mostly regurgitating some high-level requirements for PUDL scaling:
I agree it isn't clear how we can keep a low barrier for contributions while also setting up infrastructure for larger datasets. I like Zach's idea of creating an IOManager that pulls upstream assets from the most recent nightly build. Egress could get expensive but would save time, and we wouldn't be as limited by local computing power. Things that are difficult to run locally:
Correct me if I'm wrong but based on our goals with PUDL, most of the datasets we want to integrate and where we expect the most contributions are in bucket number 1. This makes me think @zaneselvans idea of separating the DAG into two groups is a good idea: 1) code that can be executed locally and that we want external folks to contribute to (your smaller EIA, EPA, FERC forms) and 2) code that requires cloud resources and we don't expect external folks to contribute to because of complexity and access to cloud resources (your EQR, PDF OCRing and beefy ML models). As Zane said, processes in group 1 can pull cached models and portions of datasets produced by processes in group 2. We'll have to get comfortable with not being able to run the full ETL locally. For example, our local full ETL configuration could be all years of our existing datasets, one year of EQR, and skips a super compute/storage intensive PDF ML workflow or pulls it from a remote cache. We can set up development VMs for folks working on big datasets and ML problems. This is a good reminder to include increases in cloud costs in our "How much does it cost to maintain PUDL" budgeting exercise. It is probably too in the weeds at this point, but dagster supports the execution of assets on Dask clusters. We could have our nightly builds execute on a distributed dask cluster setup using k8s. This will be helpful when we outgrow a single VM. |
Beta Was this translation helpful? Give feedback.
-
I found a dagster discussion about how people deploy to GCP. It looks like some folks are using Cloud Run for the long-running services and for the run workers. I wonder if we could use Cloud Run for the long-running services and batch for the runners. This way, we might not need to muck a full Kubernetes deployment. |
Beta Was this translation helpful? Give feedback.
-
Motivation
I've been thinking about this thread for a bit now, and broadly thinking about some of the limitations of our current infrastructure.
I think we're starting to see more and more signs that we're pushing the bounds of being able to usefully run PUDL locally. For example, @zaneselvans pointed out recently that total disk space is getting really high, I'm seeing locked event log DB errors, I find memory usage can get quite high depending on which parts of the ETL are running together, and the time for a full ETL run is becoming prohibitive. To some extent there is likely optimization/tuning that we can do to improve some of these problems, but for how long is that going to be true? We know there's lots of new data integration we want to do, which will of course increase our resource requirements and total processing time. We're also doing more ML modeling, and we know there's no shortage of record linkage we can do, and we've been exploring working with PDF based data. It feels like a matter of time before we find ourselves wanting to use models that require GPUs, and models we can train and persist between runs. All of these struggles lead me to believe we're headed towards a world where no one should need to run the full ETL locally. I think this could be accomplished either by deploying dagster, or further leveraging our current nightly build architecture.
Local development
If we move to only running PUDL in the cloud, this brings up the question of what our local development process will look like. I think the goal would be to make it very easy to run portions of the ETL locally in a way that reflects the production environment as best as possible. One way to enable this would be to use
IOManager
s that can materialize assets from a cloud bucket. This way, you could develop a new asset that depends on upstream assets, and fetches those assets directly from the latest nightly build products.Advantages of this setup:
Beta Was this translation helpful? Give feedback.
All reactions