Reduce peak memory use of VCE RARE assets #3959

zschira · 2024-11-14T18:00:35Z

Overview

This PR refactors VCE RARE transforms and the asset_check to reduce peak memory usage.

Makes progress on #3926 but row count asset check still fails on fast ETL.

Approach

To reduce memory usage, I refactored the transforms to work on a single year of data at a time and write outputs to parquet files. This approach writes directly to parquet without an IO manager, which doesn't feel ideal, but it does work well to reduce memory usage.

To deal with asset checks, I refactored from loading the entire table into pandas to using duckdb to query the parquet outputs. This approach is also a bit messy, but is fast/efficient and we could probably build tooling around this approach and standardize it as a part of the validation framework design.

Alternative approaches

Partitioned assets

I think using partitioned assets would be a good approach for this asset, as well as other resource intensive assets, but I believe that there are a couple reasons why that's not easy/feasible at this point. Mainly, it seems like it's best practice to maintain 1 consistent partitioning scheme per job. I think splitting the ETL into multiple jobs with partitioned assets could be a good pattern to adopt, but this felt like too big of a can of worms to tackle right now.

Dynamic op graph

The other approach I investigated was using dynamic op graphs to process years in parallel, but I found this could still lead to significant memory spikes depending on what all ends up running together and the unparallelized version doesn't take too long to run.

Tasks

Give feedback

Reduce peak transform memory usage
Reduce peak asset_check memory usage
Options

zaneselvans

One major issue -- this isn't producing a concatenated parquet file that corresponds to the output table we want to distribute, and it isn't using the PyArrow schema that is defined by the resource metadata for this output table.

Non blocking but if there's an easy way to not fast-fail the asset checks and provide more comprehensive feedback to users that would be nicer for debugging. See specific notes.

src/pudl/etl/__init__.py

src/pudl/transform/vcerare.py

… into vce_memory_fix

zschira · 2024-11-15T18:30:48Z

@zaneselvans I've updated to produce a single monolithic parquet file, which definitely makes more sense. I've also split the one big asset check into a bunch of little ones. Memory usage can get a little high when running them all in parallel, but it's not too bad. If we encounter issues running the full ETL we might want to mark them all as high-memory.

It looks like one remaining issue is a docs failure from adding the duckdb python dependency. Do you have any idea what's causing that?

zaneselvans · 2024-11-16T00:43:56Z

RTD docs build issue was because I'd used the conda rather than PyPI name for the package in pyproject.toml
Removed a couple of leftover | None return types for the asset checks.
Changed the nightly build machine specs back to 8CPU/64GB
Running a branch deployment now to see if it'll actually run on the smaller machine, how long it takes, and whether the Zenodo sandbox is still b0rked.

zschira added 3 commits November 8, 2024 16:46

Refactor VCERare transforms to process years separately

eb78ed1

Refactor VCERARE asset_check to use duckdb

7b6e3dd

Refactor and fix all duckdb checks

c61a897

zschira requested review from zaneselvans and aesharpe November 14, 2024 18:00

Merge branch 'main' into vce_memory_fix

e34beff

zaneselvans assigned zschira Nov 14, 2024

zaneselvans requested changes Nov 14, 2024

View reviewed changes

zaneselvans and others added 8 commits November 14, 2024 14:31

Add duckdb as an explicit rather than transitive dependency.

ff2db5c

Revert high-mem concurrency limit back to 4

70f6e1c

Make vcerare a monolithic parquet file again

4deafa8

Split vce asset checks into many small checks

5024d74

Make vce asset high memory again

0af559b

Merge branch 'vce_memory_fix' of github.com:catalyst-cooperative/pudl…

432702b

… into vce_memory_fix

Fix vce max hour check

83852be

Make vce asset checks more specific

7603ec1

zaneselvans added 4 commits November 15, 2024 15:57

Merge branch 'main' into vce_memory_fix

e4c1468

Get rid of None return type for asset checks.

e0760f1

Relock conda dependencies.

694a080

Switch back to 8CPU/64GB nightly build machine.

a30cbc4

zaneselvans mentioned this pull request Nov 15, 2024

Refactor VCE RARE asset checks #3926

Open

Use correct PyPI package name for duckdb and relock dependencies.

38e1f96

zaneselvans self-requested a review November 15, 2024 23:24

zaneselvans approved these changes Nov 15, 2024

View reviewed changes

zaneselvans added this pull request to the merge queue Nov 15, 2024

zaneselvans removed this pull request from the merge queue due to a manual request Nov 16, 2024

zaneselvans changed the title ~~Vce memory fix~~ Reduce peak memory use of VCE RARE assets Nov 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce peak memory use of VCE RARE assets #3959

Reduce peak memory use of VCE RARE assets #3959

zschira commented Nov 14, 2024 •

edited by zaneselvans

Loading

Tasks

zaneselvans left a comment

zschira commented Nov 15, 2024

zaneselvans commented Nov 16, 2024

Reduce peak memory use of VCE RARE assets #3959

Are you sure you want to change the base?

Reduce peak memory use of VCE RARE assets #3959

Conversation

zschira commented Nov 14, 2024 • edited by zaneselvans Loading

Overview

Approach

Alternative approaches

Partitioned assets

Dynamic op graph

Tasks

zaneselvans left a comment

Choose a reason for hiding this comment

zschira commented Nov 15, 2024

zaneselvans commented Nov 16, 2024

zschira commented Nov 14, 2024 •

edited by zaneselvans

Loading