PUDL now using Dagster, GitHub Projects, and Python 3.11 #2475

zaneselvans · 2023-03-31T18:01:49Z

zaneselvans
Mar 31, 2023
Maintainer

Dagster Orchestration

If you follow the repo at all, you've probably already noticed that we've made some big changes to the architecture. The biggest is the shift to using Dagster to orchestrate our data pipeline. You can find many more details in our release notes and the links from there to individual PRs. We've updated our development documentation to with an overview of how Dagster works and how it's different from our old homebrew system. For more in-depth background, check out Dagster's getting started guide.

This is a big step toward our goal of distributing analysis ready data rather than a complex data processing application. The software is still going to be here for anyone to build on or learn from, but we want day-to-day users to be able to grab all the clean data and all the useful outputs we've already derived from that data as easily as possible without being tied to any particular platform. (e.g. from our nightly build outputs via the AWS Open Data Registry).

EIA-860/923 sub-DAG

Help with Dagster

If you currently run the PUDL data pipeline, or are interested in contributing to the project, and need help getting the new system set up, feel free to schedule some office hours or ask a question here in GitHub Discussions and we'll be happy to lend a hand! There's a little bit of a learning curve, but our experiences so far have been very positive. Dagster makes working with the PUDL data pipeline much more enjoyable, and on a beefy laptop all the data can be processed in about 10 minutes now (except for EPA CEMS, which we are still parallelizing)

Output Tables in the PUDL DB

The next big step is converting all of our "output tables" into assets that are also managed by Dagster, and written directly into the databases and Apache Parquet files we distribute. That work is just getting started and is being coordinated through issue #1973, which contains many sub-issues, one for each collection of related tables. We're still hammering out exactly what it will look like to convert existing output tables into assets. It may include a mix of creating SQL views for simple denormalized tables, and Python assets for anything complex. Let us know if you're interested in helping with the conversion! Once we've got a few tables switched over, hopefully the pattern will be clear.

For the time being we're going to maintain the PudlTabl output object interface, and just redirect its methods to read from the database, rather than doing its own calculations, but in a future release we plan to remove this software abstraction entirely, and expect users to read tables directly from the database.

GitHub Projects

We've started doing our project management in the open using the new GitHub Projects. This means you too can:

Track our progress on the current sprint
Browse our infrastructure backlog
Check out the project roadmap (we still haven't done our Q2 planning though...)

Python 3.11 (and other dependencies)

We are moving toward treating PUDL like an application, not a library to be depended upon by other projects, and one part of that is having a narrow set of versions for our own dependencies, hopefully using a lockfile to pin them all to particular versions in the near future. PUDL is now exclusively using Python 3.11. Pandas, SQLAlchemy, and Pydantic all have some major version updates coming in the near future. See #2384.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Catalyst Cooperative

PUDL now using Dagster, GitHub Projects, and Python 3.11 #2475

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Catalyst Cooperative

PUDL now using Dagster, GitHub Projects, and Python 3.11 #2475

zaneselvans Mar 31, 2023 Maintainer

Dagster Orchestration

Help with Dagster

Output Tables in the PUDL DB

GitHub Projects

Python 3.11 (and other dependencies)

Replies: 0 comments

zaneselvans
Mar 31, 2023
Maintainer