Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce peak memory use of VCE RARE assets #3959

Open
wants to merge 17 commits into
base: main
Choose a base branch
from
Open

Conversation

zschira
Copy link
Member

@zschira zschira commented Nov 14, 2024

Overview

This PR refactors VCE RARE transforms and the asset_check to reduce peak memory usage.

Closes #3925

Makes progress on #3926 but row count asset check still fails on fast ETL.

Approach

To reduce memory usage, I refactored the transforms to work on a single year of data at a time and write outputs to parquet files. This approach writes directly to parquet without an IO manager, which doesn't feel ideal, but it does work well to reduce memory usage.

To deal with asset checks, I refactored from loading the entire table into pandas to using duckdb to query the parquet outputs. This approach is also a bit messy, but is fast/efficient and we could probably build tooling around this approach and standardize it as a part of the validation framework design.

Alternative approaches

Partitioned assets

I think using partitioned assets would be a good approach for this asset, as well as other resource intensive assets, but I believe that there are a couple reasons why that's not easy/feasible at this point. Mainly, it seems like it's best practice to maintain 1 consistent partitioning scheme per job. I think splitting the ETL into multiple jobs with partitioned assets could be a good pattern to adopt, but this felt like too big of a can of worms to tackle right now.

Dynamic op graph

The other approach I investigated was using dynamic op graphs to process years in parallel, but I found this could still lead to significant memory spikes depending on what all ends up running together and the unparallelized version doesn't take too long to run.

Tasks

@zaneselvans zaneselvans added performance Make PUDL run faster! parquet Issues related to the Apache Parquet file format which we use for long tables. dagster Issues related to our use of the Dagster orchestrator duckdb Issues referring to duckdb, the embedded OLAP database vcerare Pertaining to Vibrant Clean Energy's Resource Adequacy Renewable Energy Power datasets labels Nov 14, 2024
Copy link
Member

@zaneselvans zaneselvans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One major issue -- this isn't producing a concatenated parquet file that corresponds to the output table we want to distribute, and it isn't using the PyArrow schema that is defined by the resource metadata for this output table.

Non blocking but if there's an easy way to not fast-fail the asset checks and provide more comprehensive feedback to users that would be nicer for debugging. See specific notes.

src/pudl/etl/__init__.py Outdated Show resolved Hide resolved
src/pudl/transform/vcerare.py Show resolved Hide resolved
src/pudl/transform/vcerare.py Outdated Show resolved Hide resolved
src/pudl/transform/vcerare.py Outdated Show resolved Hide resolved
src/pudl/transform/vcerare.py Show resolved Hide resolved
src/pudl/transform/vcerare.py Outdated Show resolved Hide resolved
src/pudl/transform/vcerare.py Outdated Show resolved Hide resolved
src/pudl/transform/vcerare.py Outdated Show resolved Hide resolved
src/pudl/transform/vcerare.py Outdated Show resolved Hide resolved
src/pudl/transform/vcerare.py Outdated Show resolved Hide resolved
@zschira
Copy link
Member Author

zschira commented Nov 15, 2024

@zaneselvans I've updated to produce a single monolithic parquet file, which definitely makes more sense. I've also split the one big asset check into a bunch of little ones. Memory usage can get a little high when running them all in parallel, but it's not too bad. If we encounter issues running the full ETL we might want to mark them all as high-memory.

It looks like one remaining issue is a docs failure from adding the duckdb python dependency. Do you have any idea what's causing that?

@zaneselvans zaneselvans self-requested a review November 15, 2024 23:24
@zaneselvans zaneselvans added this pull request to the merge queue Nov 15, 2024
@zaneselvans zaneselvans removed this pull request from the merge queue due to a manual request Nov 16, 2024
@zaneselvans zaneselvans changed the title Vce memory fix Reduce peak memory use of VCE RARE assets Nov 16, 2024
@zaneselvans
Copy link
Member

  • RTD docs build issue was because I'd used the conda rather than PyPI name for the package in pyproject.toml
  • Removed a couple of leftover | None return types for the asset checks.
  • Changed the nightly build machine specs back to 8CPU/64GB
  • Running a branch deployment now to see if it'll actually run on the smaller machine, how long it takes, and whether the Zenodo sandbox is still b0rked.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dagster Issues related to our use of the Dagster orchestrator duckdb Issues referring to duckdb, the embedded OLAP database parquet Issues related to the Apache Parquet file format which we use for long tables. performance Make PUDL run faster! vcerare Pertaining to Vibrant Clean Energy's Resource Adequacy Renewable Energy Power datasets
Projects
Status: In review
Development

Successfully merging this pull request may close these issues.

Refactor VCE RARE transform to reduce peak memory usage
2 participants