generated from catalyst-cooperative/cheshire
-
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrate second pass ex 21 model improvements and move to notebook based training #95
Open
zschira
wants to merge
140
commits into
main
Choose a base branch
from
prep_paragraph_classifier
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 132 commits
Commits
Show all changes
140 commits
Select commit
Hold shift + click to select a range
5767035
Initial dagster integration
zschira 9d9fbfd
Update validate integration test to dagster infra
zschira 3da9659
Merge branch 'main' into dagster_integration
zschira ee77e7a
Generalize mltools
zschira 53d3354
Reorg repo to move towards generalized modelling repo
zschira 014bcb1
Change library module structure
zschira 5404148
Create turn experiment_tracking into sub-package
zschira 886614f
Remove unused function
zschira dec80b8
Gracefully handle mlflow run on failure
zschira e725f3d
Fix variable name
zschira df44ed5
Change experiment tracker resource names
zschira 93da052
Add mlflow artifact io-manager
zschira 07713e9
Simplify pudl_models decorator
zschira 5d89ec6
Split extraction logging into two funcs
zschira c57818a
Add mlflow metrics io-manager
zschira 625783b
Change pudl_model to pudl_pipeline
zschira 4f50a7b
Add validation pipeline
zschira f6ab22c
Streamline construction of dagster jobs for running/testing pudl models
zschira f20fb7d
Remove old comment
zschira 92e2e00
Add ex21 to dagster jobs
zschira 520e6d1
Prep for multiple code locations
zschira e99ee1a
Add top-level worksapce file
zschira 559c0e6
Restructure docs
zschira 93d02f3
Add train model job
zschira 5190bf9
Log mlflow artifacts as parquet until csv is fixed
zschira ca9599e
Fix ex21 extraction
zschira 7e7a503
Add development section to docs
zschira 61f48c3
Fix integration tests
zschira 0fd8ffc
Don't run ruff on notebooks
zschira 97d5587
xfail ex21 integration test
zschira ace268b
Add parquet upath io-manager
zschira fb1feeb
Remove nb-output clear
zschira 294ec72
Test docker deployment
zschira 4de51b3
Chunk ex 21 extraction
zschira 214e28f
Fix asign copy
zschira c5736e0
Add job for testing ex21 resource usage
zschira 4a81e88
Merge branch 'test_parquet_logging' into dagster_integration
zschira ec39633
Remove test docker files
zschira 101ccf1
Remove complex asset factory
zschira 7e0c5a5
Parallelize ex21 extraction
zschira 080d790
Don't chunk in inference module
zschira 44dfc52
Handle failures in converting to pdf
zschira 6e24157
Delete cached pdfs early
zschira cd06d07
Add metadata to chunk_filings
zschira e3e8c45
Catch oom errors while extracting ex21
zschira 350defb
Fix ex21 gcs io-manager
zschira 3c80b72
Fix partitions for basic 10k extraction.
zschira 31971b7
Cache layoutlm locally
zschira 634a050
Fix caching model
zschira 69ee4c0
Remove bad call
zschira 63d6600
Test own_per conversion
zschira c8490d4
Add pandera types for output tables
zschira fa4f57d
Add missing entities module
zschira 35e917d
Don't cache model, load with io manager
zschira a7b1c7f
Remove float conversion
zschira f019117
Add hypothesis to deps
zschira d7d13d8
Make own_per str
zschira 70f5293
Remove astype
zschira e406092
Validate ex21 return types
zschira f3835d9
Clean model download temp dir
zschira 3c995cd
Fix model return type
zschira ef55e4b
Catch errors in creating ex 21 dataset
zschira b37450a
Fix column name
zschira 06b18ed
Try to catch empty pdf errors
zschira abfc006
Print traceback in caught exception
zschira ff92a55
Fix empty pdf check
zschira 8aa8c95
Actually fix empty pdf check?
zschira 43600bc
Use UPath in GCSArchive
zschira 05ad82c
Make _configure_mlflow a standalone function
zschira fddc3b2
Merge branch 'main' into error_handling_improvements
zschira 99fc7ed
Try to skip notebooks in ruff check
zschira b135500
Pull integration test fixes from main
zschira 6e868f2
Fix typos in README.rst
zschira df4fd09
Cache downloaded layoutlm in dagster home
zschira 74d237d
Merge branch 'error_handling_improvements' of github.com:catalyst-coo…
zschira 3642765
Fix broken test
zschira 830bd74
fix rename filings
katie-lamb 2cd1fe6
fix paths to cache training data
katie-lamb 64dc8c5
update root dir path
katie-lamb 226d91c
Fix UPath initialization
zschira 3c17d33
Fix path in test
zschira df69f42
Create huggingface dataset outside model execution
zschira 2d3345c
small fixes to path handling
katie-lamb 46e7b40
Merge branch 'error_handling_improvements' into second-pass-ex21-impr…
katie-lamb 6f9d34a
Minor fixes
zschira 07d500a
Start migrating model training to notebook
zschira 81813a7
Create dataset as dataframe for logging
zschira 5174ed7
Modify dataset return type
zschira 7a572c0
Fix dataset types for model signature
zschira 5728026
Migrate ex 21 model training to a notebook
zschira 5fbbfff
Merge initial notebook migration (broken)
zschira 37edd50
Split dataset loading into separate assets
zschira d6889e3
Minor notebook fixes
zschira d5e013a
Fix import in notebook
zschira f9810db
add device to pipeline
zschira 2760881
Fix signature inference
zschira 1dcacfa
Fix notebook dagster config
zschira 39bb45b
Fix config param name
zschira cb83862
Partition training data
zschira c71593c
Add partitions to notebook asset
zschira 4efa515
Update ex21 labels
zschira 581b2e3
Use run name for specifying training runs
zschira c67a1be
Rework how notebook is configured
zschira b8a5b24
Finetune configuration
zschira 45d5cf8
separate inference dataset creation from model prediction
zschira 3e15b1f
Remove deprecated inference module
zschira 60a1260
Add notebook for training ex21 classifier
zschira 4105110
Pull in model updates
zschira 4d29037
Update classifier model
zschira 85c44ff
Fix set on copy pandas issue
zschira 52e3580
Fix model uri's
zschira b709053
Fix indices in extraction model
zschira b8dad3c
Fix typo
zschira e6b29ff
Add asset factory for loading models
zschira 3d11777
Catch layout classification NaN exception
zschira df5fe0d
Use GCS pickle io-manager
zschira d6c41a2
Switch gcs pickle io manager to upath based
zschira ddd2263
Remove duplicate logger
zschira 93bffcb
Fix config warnings
zschira d717caa
Test pin sphinx
zschira 15be127
Catch errors while normalizing bounding boxes
zschira 4117d0a
Fix call to pandera example
zschira 8c8dd60
Fix handle failures in converting to pdf
zschira ff821b5
Actually fix handle failures in converting to pdf
zschira a8eb359
Add model documentation to sec10k readme
zschira dc160ac
Fix ex 21 validation integration test
zschira 10b24a9
Improve classifier error handling
zschira ad54979
Fully broaden classifier errors
zschira 672e123
add more docs on running the notebooks
zschira 2dbdcaa
clean up feature creation in paragraph classifier
katie-lamb cda3225
fix feature creation function
katie-lamb 509b7a0
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 8855e5e
small fixes to read in comments in tracking dataframe
katie-lamb f806ad8
Merge branch 'prep_paragraph_classifier' of https://github.com/cataly…
katie-lamb 590ba60
updates to model pipeline
katie-lamb 3db47d4
take out logging messages
katie-lamb 5f23e1c
update to exclude paragraph layout docs in labeled data tracking
katie-lamb e166c3d
ad remove paragraph filenames from validation data
katie-lamb f529ca9
Split layoutlm training from inference model/validation
zschira 7dcf5ac
Merge branch 'prep_paragraph_classifier' of github.com:catalyst-coope…
zschira File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,9 +1,13 @@ | ||
"""Implement tooling to interface with mlflow experiment tracking.""" | ||
|
||
from dagster import Config, asset | ||
from pydantic import create_model | ||
|
||
from .mlflow_io_managers import ( | ||
MlflowBaseIOManager, | ||
MlflowMetricsIOManager, | ||
MlflowPandasArtifactIOManager, | ||
MlflowPyfuncModelIOManager, | ||
) | ||
from .mlflow_resource import ( | ||
MlflowInterface, | ||
|
@@ -12,6 +16,22 @@ | |
) | ||
|
||
|
||
def pyfunc_model_asset_factory(name: str, mlflow_run_uri: str): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This function will create an asset to load a model from mlflow. Using |
||
"""Create asset for loading a model logged to mlflow.""" | ||
PyfuncConfig = create_model( # NOQA: N806 | ||
f"PyfuncConfig{name}", mlflow_run_uri=(str, mlflow_run_uri), __base__=Config | ||
) | ||
|
||
@asset( | ||
name=name, | ||
io_manager_key="pyfunc_model_io_manager", | ||
) | ||
def _model_asset(config: PyfuncConfig): | ||
return config.mlflow_run_uri | ||
|
||
return _model_asset | ||
|
||
|
||
def get_mlflow_io_manager( | ||
key: str, | ||
mlflow_interface: MlflowInterface | None = None, | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this to save pickled asset outputs to GCS. This is needed because I separated the ex21 inference dataset creation from actually running the model, but the datasets take up too much space if they're saved locally.