-
-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce peak memory use of VCE RARE assets #3959
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One major issue -- this isn't producing a concatenated parquet file that corresponds to the output table we want to distribute, and it isn't using the PyArrow schema that is defined by the resource metadata for this output table.
Non blocking but if there's an easy way to not fast-fail the asset checks and provide more comprehensive feedback to users that would be nicer for debugging. See specific notes.
… into vce_memory_fix
@zaneselvans I've updated to produce a single monolithic parquet file, which definitely makes more sense. I've also split the one big asset check into a bunch of little ones. Memory usage can get a little high when running them all in parallel, but it's not too bad. If we encounter issues running the full ETL we might want to mark them all as high-memory. It looks like one remaining issue is a docs failure from adding the duckdb python dependency. Do you have any idea what's causing that? |
|
Overview
This PR refactors VCE RARE transforms and the
asset_check
to reduce peak memory usage.Closes #3925
Makes progress on #3926 but row count asset check still fails on fast ETL.
Approach
To reduce memory usage, I refactored the transforms to work on a single year of data at a time and write outputs to parquet files. This approach writes directly to parquet without an IO manager, which doesn't feel ideal, but it does work well to reduce memory usage.
To deal with asset checks, I refactored from loading the entire table into pandas to using duckdb to query the parquet outputs. This approach is also a bit messy, but is fast/efficient and we could probably build tooling around this approach and standardize it as a part of the validation framework design.
Alternative approaches
Partitioned assets
I think using partitioned assets would be a good approach for this asset, as well as other resource intensive assets, but I believe that there are a couple reasons why that's not easy/feasible at this point. Mainly, it seems like it's best practice to maintain 1 consistent partitioning scheme per job. I think splitting the ETL into multiple jobs with partitioned assets could be a good pattern to adopt, but this felt like too big of a can of worms to tackle right now.
Dynamic op graph
The other approach I investigated was using dynamic op graphs to process years in parallel, but I found this could still lead to significant memory spikes depending on what all ends up running together and the unparallelized version doesn't take too long to run.
Tasks