Laying the groundwork for future PUDL scaling #3424

zschira · 2024-02-23T16:27:07Z

zschira
Feb 23, 2024
Maintainer

Motivation

I've been thinking about this thread for a bit now, and broadly thinking about some of the limitations of our current infrastructure.

I think we're starting to see more and more signs that we're pushing the bounds of being able to usefully run PUDL locally. For example, @zaneselvans pointed out recently that total disk space is getting really high, I'm seeing locked event log DB errors, I find memory usage can get quite high depending on which parts of the ETL are running together, and the time for a full ETL run is becoming prohibitive. To some extent there is likely optimization/tuning that we can do to improve some of these problems, but for how long is that going to be true? We know there's lots of new data integration we want to do, which will of course increase our resource requirements and total processing time. We're also doing more ML modeling, and we know there's no shortage of record linkage we can do, and we've been exploring working with PDF based data. It feels like a matter of time before we find ourselves wanting to use models that require GPUs, and models we can train and persist between runs. All of these struggles lead me to believe we're headed towards a world where no one should need to run the full ETL locally. I think this could be accomplished either by deploying dagster, or further leveraging our current nightly build architecture.

Local development

If we move to only running PUDL in the cloud, this brings up the question of what our local development process will look like. I think the goal would be to make it very easy to run portions of the ETL locally in a way that reflects the production environment as best as possible. One way to enable this would be to use IOManagers that can materialize assets from a cloud bucket. This way, you could develop a new asset that depends on upstream assets, and fetches those assets directly from the latest nightly build products.

Advantages of this setup:

You don't need to run all the upstream assets yourself or download them manually
You could even access intermediate assets that are not persisted to the DB
Immediate feedback if any upstream dependencies have changed in a way that might impact your asset, without having to merge main and recreate everything
When debugging nightly builds you could re-run a failed asset with the exact inputs that from the nightly build

zaneselvans · 2024-03-06T17:50:12Z

zaneselvans
Mar 6, 2024
Maintainer

It seems like the changes driving our need for more compute resources are:

larger tabular datasets (e.g. EQR)
larger unstructured data like regulatory filing PDFs
more widespread integration of ML into our data processing, which can be computationally intensive to train, and would benefit from caching and re-use

A cloud-based Dagster deployment seems like it is intended to handle these kinds of workflows. Like it's their main commercial use-case.

Things that make me apprehensive about moving in this direction:

I think it'll significantly increase the technical barriers that have to be overcome by anybody outside of Catalyst who wants to work with the system or contribute. Few potential contributors will be able to spin up all the required infrastructure on their own, and if that infrastructure is required to know if changes you've made have broken anything, it seems like we may be precluding others from contributing.
Given that a nightly build currently costs ~$10 when run on cloud infrastructure, I'm worried that if that becomes the primary way that we do our data processing in development, and it's quick and easy to use, our cloud costs could easily range into thousands of dollars per month. How can we estimate the costs ahead of time, and control/minimize/reduce the infrastructure costs?

I can see the attraction of a hybrid system where some assets are just pulled from the most recent nightly build outputs, and others are computed locally, creating and maintaining our own system for managing these two different data sources seems like it could be a significant burden.

One of Dagster's main selling points is supposed to be the ease with which different computing environments can be swapped in via different resource configurations, enabling a relatively seamless transition between local / testing / production setups. Do they have a canonical solution that isn't "Do all your compute on our service and pay a big markup!"

DuckDB's commercial service seems to be aimed at facilitating the use of hybrid local/remote data. A related post from Dagster

Are we inevitably headed towards a system where outside contribution is technically more difficult, and requires users to have cloud access and a budget (either theirs or ours)?

5 replies

zaneselvans Mar 6, 2024
Maintainer

Another strategy could be to separate the DAG into several independent components, and hand off static / cached resources. For instance, the FERC DBF and XBRL extractions to SQLite + JSON could be a standalone process, with those databases & JSON files as the handoff. Training various models for e.g. record linkage could be a process, with the resulting model being cached for later re-use. The EPA CEMS processing could be split out.

The idea being to leave a core of the DAG that's easy to work with locally or in the cloud, which contains most of the assets and complexity, with the resource intensive or naturally separable components being kind of optional, so everybody isn't exposed to the additional infrastructure / cost / authentication -- only folks who are actually working on those more challenging portions of the DAG.

zschira Mar 7, 2024
Maintainer Author

I strongly agree that we don't want to create a huge workflow burden for external contributors, or significantly increase our nightly build costs. That being said, I think PUDL is getting large enough that there's a pretty big barrier to entry as is, and I think being able to work on smaller subsets of the dag in isolation could reduce that.

I still don't know exactly how we would enable outside contributors to interact with our cloud storage, but when @jdangerx and I were discussing the user portal stuff, he brought up that we could get crazy amounts of basically free egress through Cloudflare, so maybe there's something there. If we had a cache of all assets produced by the latest nightly build that's available through the same API we create for data distribution, we could use that to produce upstream assets, and you would just need to go through our (hopefully) straightforward registration process to get an API key.

With this type of setup, I don't think we'd really need to be kicking off extra builds unless you're working on deeper infrastructure stuff that can't easily be tested in isolation, and I think we could limit that ability to catalyst members and a few trusted contributors that are actually working on that kind of stuff. I also think there's potentially some really cool stuff we could do using asset versioning to reduce the cost of our nightly builds. With asset versioning, dagster will only run an asset if the user-supplied code version, or computed data version has changed, so we would only need to run assets that have actually changed between builds, or that have new upstream data.

bendnorman Mar 7, 2024
Maintainer

I've been confused by dagster's asset versioning features. It seems like it's only useful if you remember to update the code version for each asset whenever you edit it. You also might not be sure when you change the behavior of an asset if for example you update a general helper function. Maybe I'm not understanding something though.

zschira Mar 8, 2024
Maintainer Author

Yeah the versioning mechanism seems a bit unwieldy. I found this issue where someone requested the ability to automatically update version number based on code changes, and the response was basically that this is really difficult to implement in a generic way and they don't plan on adding that ability. I think we would have to come up with our own internal process for making code versions work. I could imagine having a major number that you update if you change shared library code or infrastructure stuff, then each asset could set its own minor version that you update every time you change the asset itself. The actual version number passed to dagster would be a combo of these, so we would do a full nightly build if the major number changes, but most builds would probably just have to update a few assets (if anything). We would have to get good at enforcing this during PR's, which could definitely be error prone, but I think the benefits might be worth trying to come up with a way to make this work.

zaneselvans Mar 8, 2024
Maintainer

Oof, that sounds like it could be a frustrating system to keep up to date, and to make reproducible.

One concern I have with any of these caching scenarios is losing track of what code and input data actually went into producing some outputs, so we can be sure that a given snapshot is all working well together. The current system has DOIs that identify all the raw data inputs, and locked software versions for all the packages, and the nightly builds recreate everything from scratch, so we know it all works together (or not) and we've had plenty of cases of folks getting confused about why something works / is broken resulting from some stale data laying round (like oops I didn't re-run the FERC extractions).

bendnorman · 2024-03-07T23:48:52Z

bendnorman
Mar 7, 2024
Maintainer

Mostly regurgitating some high-level requirements for PUDL scaling:

Limit costs
Enable processing of larger datasets and ML models
Keep code contributions accessible
Limit complexity and overall maintenance of the scaling infrastructure

I agree it isn't clear how we can keep a low barrier for contributions while also setting up infrastructure for larger datasets. I like Zach's idea of creating an IOManager that pulls upstream assets from the most recent nightly build. Egress could get expensive but would save time, and we wouldn't be as limited by local computing power.

Things that are difficult to run locally:

All years of all small - medium datasets (where we currently are)
All years of larger datasets (EQR)
Entity resolution models. I could see this really becoming a problem when we try to operate on EQR data.
PDF OCRing

Correct me if I'm wrong but based on our goals with PUDL, most of the datasets we want to integrate and where we expect the most contributions are in bucket number 1. This makes me think @zaneselvans idea of separating the DAG into two groups is a good idea: 1) code that can be executed locally and that we want external folks to contribute to (your smaller EIA, EPA, FERC forms) and 2) code that requires cloud resources and we don't expect external folks to contribute to because of complexity and access to cloud resources (your EQR, PDF OCRing and beefy ML models). As Zane said, processes in group 1 can pull cached models and portions of datasets produced by processes in group 2.

We'll have to get comfortable with not being able to run the full ETL locally. For example, our local full ETL configuration could be all years of our existing datasets, one year of EQR, and skips a super compute/storage intensive PDF ML workflow or pulls it from a remote cache. We can set up development VMs for folks working on big datasets and ML problems.

This is a good reminder to include increases in cloud costs in our "How much does it cost to maintain PUDL" budgeting exercise.

It is probably too in the weeds at this point, but dagster supports the execution of assets on Dask clusters. We could have our nightly builds execute on a distributed dask cluster setup using k8s. This will be helpful when we outgrow a single VM.

2 replies

zaneselvans Mar 8, 2024
Maintainer

I think your lists of requirements and the categories of difficult to run processes sounds about right.

With scaling up compute resources, what makes something like a Dask cluster running on k8s desirable vs. a bigger VM? There are monster individual machines available, far beyond anything we need, though obviously they cost much more. Is the advantage of something like dask cluster that it scales up and down depending on the workload that you throw at it, so you're not paying for idle compute? Whereas with a big VM, it's all-or-nothing?

Especially for embarrassingly parallelizable datasets like the CEMS or EQR, I could see ephemeral, scalable resources being very nice, since it's the same amount of CPU cycles whether you run 30 years all at once taking 5 minutes, or 30 years in series, taking 150 minutes.

I agree that most of the development and data wrangling complexity will continue to reside in the "many small to medium sized datasets" that are mostly what we've got now. Input data in the 10s of MB to 100s of MB or maybe 1GB, total outputs in the 1-10GB range, but hundreds of tables.

If we can separate the DAGs, and have each of them do nightly builds when they get new code or data, with all the inputs and software versioned, then each of them could be individually updated and tracked. E.g. the extracted FERC DBs only get updated if we bring in a new year of data, or mess with the ferc-xbrl-extractor code or update its dependencies. It generates a versioned release with locked dependencies and input DOIs, and those reproducible SQLite & JSON files are available with no further compute as inputs into the main PUDL ETL. I imagine a similar process for the EQR, which takes it from zipfiles of zipfiles to normalized, partitioned parquet files, independent of anything else in PUDL, and then those Parquet files can be a relatively infrequently updated input to whatever record-linkage or summary outputs live downstream. Which like you say for testing / development purposes, maybe we only ever access one of them, just so we can know that the downstream dependencies are nominally functional.

I think where this model maybe breaks down or becomes frustratingly complex, is when you have these big / computationally intensive assets woven throughout the main DAG, so they have upstream and downstream dependencies that are part of the "core" DAG, and can't easily be separated. I guess the record-linkage models might be like this. In which case I guess maybe you're fall back on using a model that was trained on all the data in like, yesterday's full build which ran on cloud infrastructure, and is getting versioned and stashed somewhere for reproducible access?

zschira Mar 8, 2024
Maintainer Author

I think moving some known bottlenecks/expected future bottlenecks to cloud only like CEMS, EQR, and large ML models seems like a good first step to trjust don't thinkansitioning away from running the full ETL locally, but I still have a couple hesitations.

My first thought builds on @zaneselvans point that things become more difficult when these assets show up elsewhere in the DAG. In these cases, we would still need a way for developers to access these assets from whatever cloud storage we use to cache them, and provide tooling for versioning and ensuring asset freshness. At this point, it kind of feels like we might as well be building a more comprehensive system that could be applied to more than just a few assets. These could still be good beta testers for this system, but I think it would be worth building towards a future where we can do more substantial caching.

My second thought is just that I'm worried even our small/medium datasets are becoming pretty burdensome to run locally, and are only going to get worse over time. We're already starting the SEC10k integration, and just applied for funding to do a bunch of new natural gas data integration. Considering that all of this new data integration will also most likely include new record linkage with new ML models, this is only going to make things worse. I just don't think it's a great workflow when you're working on one thing, then want to test it so you start a full-etl run, but now you can't switch branches to work on something else until your 1.5h ETL run is complete. Or, if you're reviewing someone else's code or just want to help them debug something, and you switch to their branch and now you have to figure out when you last materialized the upstream assets, or if you just need to re-run the entire ETL.

The point about executing on a dask cluster also makes me think about again about just fully deploying dagster. I wonder how expensive it would be if we were running the dagster daemon on an absolute minimal bare-bones VM, then launch asset execution on a dask cluster with ephemeral resources? If we combined this asset versioning so we're only actually computing a subset of assets on a regular basis, could we get to a point where we have a fully deployed dagster for similar costs to our current nightly build setup?

bendnorman · 2024-06-19T00:45:31Z

bendnorman
Jun 19, 2024
Maintainer

I found a dagster discussion about how people deploy to GCP. It looks like some folks are using Cloud Run for the long-running services and for the run workers. I wonder if we could use Cloud Run for the long-running services and batch for the runners. This way, we might not need to muck a full Kubernetes deployment.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Catalyst Cooperative

Laying the groundwork for future PUDL scaling #3424

{{title}}

Replies: 3 comments 7 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Catalyst Cooperative

Laying the groundwork for future PUDL scaling #3424

zschira Feb 23, 2024 Maintainer

Motivation

Local development

Replies: 3 comments · 7 replies

zaneselvans Mar 6, 2024 Maintainer

zaneselvans Mar 6, 2024 Maintainer

zschira Mar 7, 2024 Maintainer Author

bendnorman Mar 7, 2024 Maintainer

zschira Mar 8, 2024 Maintainer Author

zaneselvans Mar 8, 2024 Maintainer

bendnorman Mar 7, 2024 Maintainer

zaneselvans Mar 8, 2024 Maintainer

zschira Mar 8, 2024 Maintainer Author

bendnorman Jun 19, 2024 Maintainer

zschira
Feb 23, 2024
Maintainer

Replies: 3 comments 7 replies

zaneselvans
Mar 6, 2024
Maintainer

zaneselvans Mar 6, 2024
Maintainer

zschira Mar 7, 2024
Maintainer Author

bendnorman Mar 7, 2024
Maintainer

zschira Mar 8, 2024
Maintainer Author

zaneselvans Mar 8, 2024
Maintainer

bendnorman
Mar 7, 2024
Maintainer

zaneselvans Mar 8, 2024
Maintainer

zschira Mar 8, 2024
Maintainer Author

bendnorman
Jun 19, 2024
Maintainer