PUDL now using Dagster, GitHub Projects, and Python 3.11 #2475
zaneselvans
announced in
Announcements
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Dagster Orchestration
If you follow the repo at all, you've probably already noticed that we've made some big changes to the architecture. The biggest is the shift to using Dagster to orchestrate our data pipeline. You can find many more details in our release notes and the links from there to individual PRs. We've updated our development documentation to with an overview of how Dagster works and how it's different from our old homebrew system. For more in-depth background, check out Dagster's getting started guide.
This is a big step toward our goal of distributing analysis ready data rather than a complex data processing application. The software is still going to be here for anyone to build on or learn from, but we want day-to-day users to be able to grab all the clean data and all the useful outputs we've already derived from that data as easily as possible without being tied to any particular platform. (e.g. from our nightly build outputs via the AWS Open Data Registry).
EIA-860/923 sub-DAG
Help with Dagster
If you currently run the PUDL data pipeline, or are interested in contributing to the project, and need help getting the new system set up, feel free to schedule some office hours or ask a question here in GitHub Discussions and we'll be happy to lend a hand! There's a little bit of a learning curve, but our experiences so far have been very positive. Dagster makes working with the PUDL data pipeline much more enjoyable, and on a beefy laptop all the data can be processed in about 10 minutes now (except for EPA CEMS, which we are still parallelizing)
Output Tables in the PUDL DB
The next big step is converting all of our "output tables" into assets that are also managed by Dagster, and written directly into the databases and Apache Parquet files we distribute. That work is just getting started and is being coordinated through issue #1973, which contains many sub-issues, one for each collection of related tables. We're still hammering out exactly what it will look like to convert existing output tables into assets. It may include a mix of creating SQL views for simple denormalized tables, and Python assets for anything complex. Let us know if you're interested in helping with the conversion! Once we've got a few tables switched over, hopefully the pattern will be clear.
For the time being we're going to maintain the
PudlTabl
output object interface, and just redirect its methods to read from the database, rather than doing its own calculations, but in a future release we plan to remove this software abstraction entirely, and expect users to read tables directly from the database.GitHub Projects
We've started doing our project management in the open using the new GitHub Projects. This means you too can:
Python 3.11 (and other dependencies)
We are moving toward treating PUDL like an application, not a library to be depended upon by other projects, and one part of that is having a narrow set of versions for our own dependencies, hopefully using a lockfile to pin them all to particular versions in the near future. PUDL is now exclusively using Python 3.11. Pandas, SQLAlchemy, and Pydantic all have some major version updates coming in the near future. See #2384.
Beta Was this translation helpful? Give feedback.
All reactions